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1 Introduction 


Computer science as an academic discipline began in the 1960’s. Emphasis was on 
programming languages, compilers, operating systems, and the mathematical theory that 
supported these areas. Courses in theoretical computer science covered finite automata, 
regular expressions, context-free languages, and computability. In the 1970’s, the study 
of algorithms was added as an important component of theory. The emphasis was on 
making computers useful. Today, a fundamental change is taking place and the focus is 
more on a wealth of applications. There are many reasons for this change. The merging 
of computing and communications has played an important role. The enhanced ability 
to observe, collect, and store data in the natural sciences, in commerce, and in other 
fields calls for a change in our understanding of data and how to handle it in the modern 
setting. The emergence of the web and social networks as central aspects of daily life 
presents both opportunities and challenges for theory. 


While traditional areas of computer science remain highly important, increasingly re- 
searchers of the future will be involved with using computers to understand and extract 
usable information from massive data arising in applications, not just how to make com- 
puters useful on specific well-defined problems. With this in mind we have written this 
book to cover the theory we expect to be useful in the next 40 years, just as an under- 
standing of automata theory, algorithms, and related topics gave students an advantage 
in the last 40 years. One of the major changes is an increase in emphasis on probability, 
statistics, and numerical methods. 


Early drafts of the book have been used for both undergraduate and graduate courses. 
Background material needed for an undergraduate course has been put in the appendix. 
For this reason, the appendix has homework problems. 


Modern data in diverse fields such as information processing, search, and machine 
learning is often advantageously represented as vectors with a large number of compo- 
nents. The vector representation is not just a book-keeping device to store many fields 
of a record. Indeed, the two salient aspects of vectors: geometric (length, dot products, 
orthogonality, etc.) and linear algebraic (independence, rank, singular values, etc.) turn 
out to be relevant and useful. Chapters 2 and 3 lay the foundations of geometry and 
linear algebra respectively. More specifically, our intuition from two or three dimensional 
space can be surprisingly off the mark when it comes to high dimensions. Chapter 2 
works out the fundamentals needed to understand the differences. The emphasis of the 
chapter, as well as the book in general, is to get across the intellectual ideas and the 
mathematical foundations rather than focus on particular applications, some of which are 
briefly described. Chapter 3 focuses on singular value decomposition (SVD) a central tool 
to deal with matrix data. We give a from-first-principles description of the mathematics 
and algorithms for SVD. Applications of singular value decomposition include principal 
component analysis, a widely used technique which we touch upon, as well as modern 


applications to statistical mixtures of probability densities, discrete optimization, etc., 
which are described in more detail. 


Exploring large structures like the web or the space of configurations of a large system 
with deterministic methods can be prohibitively expensive. Random walks (also called 
Markov Chains) turn out often to be more efficient as well as illuminative. The station- 
ary distributions of such walks are important for applications ranging from web search to 
the simulation of physical systems. The underlying mathematical theory of such random 
walks, as well as connections to electrical networks, forms the core of Chapter 4 on Markov 
chains. 


One of the surprises of computer science over the last two decades is that some domain- 
independent methods have been immensely successful in tackling problems from diverse 
areas. Machine learning is a striking example. Chapter 5 describes the foundations 
of machine learning, both algorithms for optimizing over given training examples, as 
well as the theory for understanding when such optimization can be expected to lead to 
good performance on new, unseen data. This includes important measures such as the 
Vapnik-Chervonenkis dimension, important algorithms such as the Perceptron Algorithm, 
stochastic gradient descent, boosting, and deep learning, and important notions such as 
regularization and overfitting. 


The field of algorithms has traditionally assumed that the input data to a problem is 
presented in random access memory, which the algorithm can repeatedly access. This is 
not feasible for problems involving enormous amounts of data. The streaming model and 
other models have been formulated to reflect this. In this setting, sampling plays a crucial 
role and, indeed, we have to sample on the fly. In Chapter 6 we study how to draw good 
samples efficiently and how to estimate statistical and linear algebra quantities, with such 
samples. 


While Chapter 5 focuses on supervised learning, where one learns from labeled training 
data, the problem of unsupervised learning, or learning from unlabeled data, is equally 
important. A central topic in unsupervised learning is clustering, discussed in Chapter 
7. Clustering refers to the problem of partitioning data into groups of similar objects. 
After describing some of the basic methods for clustering, such as the k-means algorithm, 
Chapter 7 focuses on modern developments in understanding these, as well as newer al- 
gorithms and general frameworks for analyzing different kinds of clustering problems. 


Central to our understanding of large structures, like the web and social networks, is 
building models to capture essential properties of these structures. The simplest model 
is that of a random graph formulated by Erdós and Renyi, which we study in detail in 
Chapter 8, proving that certain global phenomena, like a giant connected component, 
arise in such structures with only local choices. We also describe other models of random 
graphs. 
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Chapter 9 focuses on linear-algebraic problems of making sense from data, in par- 
ticular topic modeling and non-negative matrix factorization. In addition to discussing 
well-known models, we also describe some current research on models and algorithms with 
provable guarantees on learning error and time. This is followed by graphical models and 
belief propagation. 


Chapter 10 discusses ranking and social choice as well as problems of sparse represen- 
tations such as compressed sensing. Additionally, Chapter 10 includes a brief discussion 
of linear programming and semidefinite programming. Wavelets, which are an impor- 
tant method for representing signals across a wide range of applications, are discussed in 
Chapter 11 along with some of their fundamental mathematical properties. The appendix 
includes a range of background material. 


A word about notation in the book. To help the student, we have adopted certain 
notations, and with a few exceptions, adhered to them. We use lower case letters for 
scalar variables and functions, bold face lower case for vectors, and upper case letters 
for matrices. Lower case near the beginning of the alphabet tend to be constants, in the 
middle of the alphabet, such as 7, 7, and k, are indices in summations, n and m for integer 
sizes, and x, y and z for variables. If A is a matrix its elements are a;; and its rows are aj. 
If aj is a vector its coordinates are a;;. Where the literature traditionally uses a symbol 
for a quantity, we also used that symbol, even if it meant abandoning our convention. If 
we have a set of points in some vector space, and work with a subspace, we use n for the 
number of points, d for the dimension of the space, and k for the dimension of the subspace. 


The term “almost surely” means with probability tending to one. We use Inn for the 
natural logarithm and log n for the base two logarithm. If we want base ten, we will use 
log 1). To simplify notation and to make it easier to read we use E?(1—u) for (E(1 — 2) 
and E(1 — 2)? for E ((1—x)?) . When we say “randomly select” some number of points 
from a given probability distribution, independence is always assumed unless otherwise 
stated. 
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2 High-Dimensional Space 


2.1 Introduction 


High dimensional data has become very important. However, high dimensional space 
is very different from the two and three dimensional spaces we are familiar with. Generate 
n points at random in d-dimensions where each coordinate is a zero mean, unit variance 
Gaussian. For sufficiently large d, with high probability the distances between all pairs 
of points will be essentially the same. Also the volume of the unit ball in d-dimensions, 
the set of all points x such that |x| < 1, goes to zero as the dimension goes to infinity. 
The volume of a high dimensional unit ball is concentrated near its surface and is also 
concentrated at its equator. These properties have important consequences which we will 
consider. 


2.2 The Law of Large Numbers 


If one generates random points in d-dimensional space using a Gaussian to generate 
coordinates, the distance between all pairs of points will be essentially the same when d 
is large. The reason is that the square of the distance between two points y and z, 

d 

ly =z = > ui Bi) 

i=1 
can be viewed as the sum of d independent samples of a random variable x that is the 
squared difference of two Gaussians. In particular, we are summing independent samples 
x; = (y; —2:)? of a random variable x of bounded variance. In such a case, a general bound 
known as the Law of Large Numbers states that with high probability, the average of the 
samples will be close to the expectation of the random variable. This in turn implies that 
with high probability, the sum is close to the sum’s expectation. 


Specifically, the Law of Large Numbers states that 
Prob (| #4224 ee > e) < ZW (2.1) 


n = ne 
The larger the variance of the random variable, the greater the probability that the error 
will exceed e. Thus the variance of x is in the numerator. The number of samples n is in 
the denominator since the more values that are averaged, the smaller the probability that 
the difference will exceed e. Similarly the larger e is, the smaller the probability that the 
difference will exceed e and hence e is in the denominator. Notice that squaring e makes 
the fraction a dimensionless quantity. 





— Elx) 








We use two inequalities to prove the Law of Large Numbers. The first is Markov’s 
inequality that states that the probability that a non-negative random variable exceeds a 
is bounded by the expected value of the variable divided by a. 
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Theorem 2.1 (Markov's inequality) Let x be a non-negative random variable. Then 
fora >0, 
E(2) 


Prob(a > a) < A 





Proof: For a continuous non-negative random variable x with probability density p, 





E (x) = fewa = [v@acs Fantoyaz 
0 0 a 
> [roa > a | pla)de = aProb(x > a). 
Thus, Prob(x > a) < ZE. E 


The same proof works for discrete random variables with sums instead of integrals. 


Corollary 2.2 Prob (x > bE(x)) < $ 


Markov’s inequality bounds the tail of a distribution using only information about the 
mean. A tighter bound can be obtained by also using the variance of the random variable. 


Theorem 2.3 (Chebyshev’s inequality) Let x be a random variable. Then for c > 0, 


Prob( la — B(x) > c) < e 


Proof: Prob(|x — E(x)| > c) = Prob(|x — E(x)[? > c?). Note that y = |x — E(x)|? is a 
non-negative random variable and E(y) = Var(x), so Markov’s inequality can be applied 
giving: 

E(|x — E(x)|?) _ Var(a) 


c2 c 





Prob(|x — E(x)| > c) = Prob (|x — E(x)? > ê) < 
E 


The Law of Large Numbers follows from Chebyshev’s inequality together with facts 
about independent random variables. Recall that: 
E(x +y) = Elx) + Ely), 
Var(x — c) = Var (x), 
Var(cx) = *Var(zx). 
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Also, if x and y are independent, then E(xy) = E(x)E(y). These facts imply that if x 
and y are independent then Var(x + y) = Var(x) + Var(y), which is seen as follows: 


Var(x +y) = Elx +y) — E’ (x + y) 
= E(x? + 2ry + y”) — (E° (x) +2E(£)E(y) + E*(y)) 
= E(x”) — E’ (x) + Ely”) — E (y) = Var(x) + Var(y), 
where we used independence to replace E(2xy) with 2E(1)E(y). 


Theorem 2.4 (Law of Large Numbers) Let 1,,13,...,T, be n independent samples 
of a random variable x. Then 


Ti + Lats: + Tn 
n 


Var(x) 


ne 








Prob( | - Elo) > e) < 


Proof: E ztette) = E(x) and thus 


Tı T2 +: + Tn 
n 











Prob ( 


-Ela)| > €) =Prob([2 72 Ee p(T TE) > €) 


n 
By Chebyshev’s inequality 


Tı + Lats: + Tp 
n 











Prob( mimt ti > AO E 


- Elo) >€) = Prob( = 
Var (abet ten) 


ss ES 








1 
= raga Var(ti +02 +++ + Gn) 


= -a (Var(x1) + Var(12) +--+ + Var(2,)) 


_ Var(x) 


ne 





The Law of Large Numbers is quite general, applying to any random variable x of 
finite variance. Later we will look at tighter concentration bounds for spherical Gaussians 
and sums of 0-1 valued random variables. 


One observation worth making about the Law of Large Numbers is that the size of the 
universe does not enter into the bound. For instance, if you want to know what fraction 
of the population of a country prefers tea to coffee, then the number n of people you need 
to sample in order to have at most a 6 chance that your estimate is off by more than e 
depends only on e and ô and not on the population of the country. 
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As an application of the Law of Large Numbers, let z be a d-dimensional random point 
whose coordinates are each selected from a zero mean, > variance Gaussian. We set the 
variance to + so the Gaussian probability density equals one at the origin and is bounded 
below throughout the unit ball by a constant.! By the Law of Large Numbers, the square 
of the distance of z to the origin will be O(d) with high probability. In particular, there is 
vanishingly small probability that such a random point z would lie in the unit ball. This 
implies that the integral of the probability density over the unit ball must be vanishingly 
small. On the other hand, the probability density in the unit ball is bounded below by a 
constant. We thus conclude that the unit ball must have vanishingly small volume. 


Similarly if we draw two points y and z from a d-dimensional Gaussian with unit 
variance in each direction, then |y|? ~ d and |z|? ~ d. Since for all å, 


Ely — 24)" = E(y;) + E(2;) — 2 (yizi) = Var(yi) + Var (z;) — 2E(yi)E (zi) = 2, 


d 

ly—z|? = Y (y;¡—21)? ~ 2d. Thus by the Pythagorean theorem, the random d-dimensional 
i=l 

y and z must be approximately orthogonal. This implies that if we scale these random 

points to be unit length and call y the North Pole, much of the surface area of the unit ball 

must lie near the equator. We will formalize these and related arguments in subsequent 

sections. 


We now state a general theorem on probability tail bounds for a sum of indepen- 
dent random variables. Tail bounds for sums of Bernoulli, squared Gaussian and Power 
Law distributed random variables can all be derived from this. The table in Figure 2.1 
summarizes some of the results. 


Theorem 2.5 (Master Tail Bounds Theorem) Let £ = zı + £2 +-+- + £n, where 
£1, £2, ...,Zn are mutually independent random variables with zero mean and variance at 
most 0?. Let0 <a < y2no?. Assume that |E(x*)| < os! for s = 3,4,..., |(a?/4no?)]. 
Then, 

Prob (|x| > a) < seo 


The proof of Theorem 2.5 is elementary. A slightly more general version, Theorem 12.5, 
is given in the appendix. For a brief intuition of the proof, consider applying Markov’s 
inequality to the random variable x" where r is a large even number. Since r is even, x” 
is non-negative, and thus Prob(|x| > a) = Prob(1" > a”) < E(x")/a”. If E(x") is not 
too large, we will get a good bound. To compute E(x"), write E(x) as E(1,+...+2p)" 
and expand the polynomial into a sum of terms. Use the fact that by independence 
El(apx7) = Elx) E(u/) to get a collection of simpler expectations that can be bounded 
using our assumption that |E(x?)| < 0?s!. For the full proof, see the appendix. 





Tf we instead used variance 1, then the density at the origin would be a decreasing function of d, 
1 


namely (2)%?, making this argument more complicated. 
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Condition Tail bound 
Markov a> 0 Prob(x > a) < EG) 
Chebyshev Any x Prob(|x — E(x)| > a) < vere) 
Chernoff £ = T1 + z2 +: + En Prob(|x — E(x)| > eE(x)) 
x; € [0,1] iid. Bernoulli; < 3er Be) 

Higher Moments r positive even integer Prob(|z| > a) < Elx")/a” 
Gaussian T= +r +- +22 Prob(|x — yn| > 8) < 367P 
Annulus xi ~ N(0,1); 8 < yn indep. 

Power Law T= T1 + T2 +... + En Prob(|x — E(x)| > eE(2)) 
for x;; order k > 4 zi iid eE 1/k? < (4/2 +02 

















Figure 2.1: Table of Tail Bounds. The Higher Moments bound is obtained by apply- 
ing Markov to x”. The Chernoff, Gaussian Annulus, and Power Law bounds follow from 
Theorem 2.5 which is proved in the appendix. 


2.3 The Geometry of High Dimensions 


An important property of high-dimensional objects is that most of their volume is 
near the surface. Consider any object A in RY Now shrink A by a small amount e to 
produce a new object (1 — e) 4 = {(1 — ejxlx € A}. Then the following equality holds: 


volume((1 — e) 4) = (1 — €)“volume(A). 


To see that this is true, partition A into infinitesimal cubes. Then, (1 — £) 4 is the union 
of a set of cubes obtained by shrinking the cubes in A by a factor of 1 — e. When we 
shrink each of the 2d sides of a d-dimensional cube by a factor f, its volume shrinks by a 
factor of f¢. Using the fact that 1 — x < e~*, for any object A in R? we have: 


volume((1 — e) 4) 


= i d < ed 
volume(A) a ae 





Fixing e and letting d — oo, the above quantity rapidly approaches zero. This means 
that nearly all of the volume of A must be in the portion of A that does not belong to 
the region (1 — €)A. 


Let S denote the unit ball in d dimensions, that is, the set of points within distance 
one of the origin. An immediate implication of the above observation is that at least a 
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Annulus of 
AN 


Figure 2.2: Most of the volume of the d-dimensional ball of radius r is contained in an 
annulus of width O(r/d) near the boundary. 


1 — e“ fraction of the volume of the unit ball is concentrated in S \ (1 — e) S, namely 
in a small annulus of width e at the boundary. In particular, most of the volume of the 
d-dimensional unit ball is contained in an annulus of width O(1/d) near the boundary. If 
the ball is of radius r, then the annulus width is O (2) : 


2.4 Properties of the Unit Ball 


We now focus more specifically on properties of the unit ball in d-dimensional space. 
We just saw that most of its volume is concentrated in a small annulus of width O(1/d) 
near the boundary. Next we will show that in the limit as d goes to infinity, the volume of 
the ball goes to zero. This result can be proven in several ways. Here we use integration. 


2.4.1 Volume of the Unit Ball 


To calculate the volume V (d) of the unit ball in Rf, one can integrate in either Cartesian 
or polar coordinates. In Cartesian coordinates the volume is given by 


Es 2 2 
21=1 22=4/ 1-27 La=xy/ 1-1" =2G_1 


v@= f fo fo dea dente 


=" 1 y= E ta==y/ Loria, 


Since the limits of the integrals are complicated, it is easier to integrate using polar 
coordinates. In polar coordinates, V(d) is given by 


V (d) = i j r™'drdQ. 


sd r=0 


Since the variables Q and r do not interact, 
1 
1 A 
V (d) = Jafe es ; [a= AM) 
d d 
Sd r=0 sa 
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where A(d) is the surface area of the d-dimensional unit ball. For instance, for d = 3 the 


surface area is 4r and the volume is ST. The question remains, how to determine the 


surface area A (d) = f dQ for general d. 
ga 


Consider a different integral 


I (d) = f fo fOr Daag deadar, 


Including the exponential allows integration to infinity rather than stopping at the surface 
of the sphere. Thus, /(d) can be computed by integrating in both Cartesian and polar 
coordinates. Integrating in polar coordinates will relate /(d) to the surface area A(d). 
Equating the two results for 1(d) allows one to solve for A(d). 


First, calculate [(d) by integration in Cartesian coordinates. 


86 d 


I (d) = a = (vT)? Si 
Here, we have used the fact that fee e7? da = x. For a proof of this, see Section 12.2 


of the appendix. Next, calculate /(d) by integrating in polar coordinates. The volume of 
the differential element is r*dQdr. Thus, 


I (d) = f dQ f ert dr. 
sd 0 


The integral f dQ is the integral over the entire solid angle and gives the surface area, 
Sd 


A(d), of a unit sphere. Thus, /(d) = A(d) 


d 
2. 


pda, Evaluating the remaining 


integral gives 


fete ar = for (Seda) = TG lin = z! (5) 
0 n i 


and hence, I(d) = A(d)4I ($) where the Gamma function T (x) is a generalization of the 
factorial function for non-integer values of x. I (x) = (x —1)P (x — 1), r (1) =T (2) = 1, 


and T (3) = yr. For integer x, I (x) = (x — 1)!. 


d 
Combining I (d) = 72 with I (d) = A (d) ¿1 (4) yields 


2 
d 
T2 


r (5) 





A(d) = 


DiR 


establishing the following lemma. 
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Lemma 2.6 The surface area A(d) and the volume V(d) of a unit-radius ball in d di- 
mensions are given by 


NI 





272 2 
= <7 and V (d) = a, 
d P(5) 
To check the formula for the volume of a unit ball, note that V (2) = 7 and V (3) = 
3 


aa 17, which are the correct volumes for the unit balls in two and three dimen- 





DOME 
sions. To check the formula for the surface area of a unit ball, note that A(2) = 2r and 
3 
A(3) = we = 4r, which are the correct surface areas for the unit ball in two and three 
2 


dimensions. Note that 2 is an exponential in z and T (2) grows as the factorial of s, 
This implies that jim V(d) = 0, as claimed. 
00 


2.4.2 Volume Near the Equator 


An interesting fact about the unit ball in high dimensions is that most of its volume 
is concentrated near its “equator”. In particular, for any unit-length vector v defining 
“north”, most of the volume of the unit ball lies in the thin slab of points whose dot- 
product with v has magnitude O(1/Vd). To show this fact, it suffices by symmetry to fix 
v to be the first coordinate vector. That is, we will show that most of the volume of the 
unit ball has |2,| = O(1/Vd). Using this fact, we will show that two random points in the 
unit ball are with high probability nearly orthogonal, and also give an alternative proof 
from the one in Section 2.4.1 that the volume of the unit ball goes to zero as d > oo. 


Theorem 2.7 Forc > 1 and d > 3, at least a 1 — 2602 fraction of the volume of the 


d-dimensional unit ball has |x| < Tet 


Proof: By symmetry we just need to prove that at most a 2.) 2 fraction of the half of 
the ball with zı > 0 has x; > Jet Let A denote the portion of the ball with x, > aa 
and let H denote the upper hemisphere. We will then show that the ratio of the volume 
of A to the volume of H goes to zero by calculating an upper bound on volume(A) and 


a lower bound on volume(HA) and proving that 





volume(A) _ upper bound volume(A) 2 
volume(H) 7 lower bound volume(H) — co 


| 








To calculate the volume of A, integrate an incremental volume that is a disk of width 
dx; and whose face is a ball of dimension d — 1 and radius y/1 — x7. The surface area of 


d-1 


the disk is (1 — x?) 2 V(d— 1) and the volume above the slice is 


1 
volume(A) = f (1 — a) 7 V(d — 1)dzı 


€ 
d—1 
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Figure 2.3: Most of the volume of the upper hemisphere of the d-dimensional ball is 
below the plane x, = Jar 


To get an upper bound on the above integral, use 1 — x < e~* and integrate to infinity. 
To integrate, insert LY viL, which is greater than one in the range of integration, into the 
integral. Then 


e xivd— 1 d-1,,2 po 




















volume(A) < wee ee AV (d — Ddx, = V (d — er de, 
a id 
Now 
le me “ide : cali i oF 
1 1=- = 
vá d=1 ÉS ¿4-1 





Thus, an upper bound on volume(A) is a 





The volume of the hemisphere below the plane zı = Ta is a lower bound on the entire 


. . . š ; 1 
volume of the upper hemisphere and this volume is at least that of a cylinder of height Va 
and radius 4/1— ++. The volume of the cylinder is V(d— 1) (1 — =D as = Using the 
fact that (1— x)" > 1—az for a > 1, the volume of the cylinder is at least va- Ma) for d > 3. 











a 
Thus, 
Vii ase 
bi upper bound above plane vae a 
= < lower bound total hemisphere nal E 


One might ask why we computed a lower bound on the total hemisphere since it is one 
half of the volume of the unit ball which we already know. The reason is that the volume 
of the upper hemisphere is EV (d) and we need a formula with V(d— 1) in it to cancel the 
V(d — 1) in the numerator. 
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Near orthogonality. One immediate implication of the above analysis is that if we 
draw two points at random from the unit ball, with high probability their vectors will be 
nearly orthogonal to each other. Specifically, from our previous analysis in Section 2.3, 
with high probability both will be close to the surface and will have length 1 — O(1/d). 
From our analysis above, if we define the vector in the direction of the first point as 
“north”, with high probability the second will have a projection of only +0(1/vVd) in 
this direction, and thus their dot-product will be +O(1/Vd). This implies that with high 
probability, the angle between the two vectors will be 7/2 + O(1/Vd). In particular, we 
have the following theorem that states that if we draw n points at random in the unit 
ball, with high probability all points will be close to unit length and each pair of points 
will be almost orthogonal. 


Theorem 2.8 Consider drawing n points X1,X2,...,Xn at random from the unit ball. 
With probability 1 — O(1/n) 


1. [xi > 1 — 22 for alli, and 





2. [xi x| < at for alli £j. 


Proof: For the first part, for any fixed 7 by the analysis of Section 2.3, the probability 
that |x;| < 1 — is less than e“. Thus 


2Inn 
d 


(222 )d 





Prob(|x;| < 1 — ) <e =1/n?. 


2an 


By the union bound, the probability there exists an ¿ such that |x;| < 1 — is at most 


1/n. 


For the second aa Theorem 2.7 states that for a component of a Gaussian vector 


2 
the probability |x;| > + is at most 2 ¿€ 2. There are és ) pairs ¿ and j and for each such 
pair if we define x; as north” , the probability that the projection of x; onto the “north” 





direction is more than ve is at most O(e7“2) = O(n). Thus, the dot-product 
condition is violated with probability at most O ((3)n~°) = O(1/n) as well. E 


Alternative proof that volume goes to zero. Another immediate implication of 
Theorem 2.7 is that as d — oo, the volume of the ball approaches Zero. Ae con- 
sider a small box centered at the origin of side lengt AT" 


that for c = 2V In d, this box contains over half of the volume of the ball. On a other 
hand, the volume of this box clearly goes to zero as d goes to infinity, since its volume is 


O((24)4/*). Thus the volume of the ball goes to zero as well. 














By Theorem 2.7 with c = 2V/ Ind, the fraction of the volume of the ball with |x| > 
is at most: 


Vea 
2 l -2ma__ 1 1 


ee = 
ln d Vind  d? 
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Figure 2.4: Illustration of the relationship between the sphere and the cube in 2, 4, and 
d-dimensions. 


Since this is true for each of the d dimensions, by a union bound at most a O(5) oe 


2 
fraction of the volume of the ball lies outside the cube, completing the proof. 


Discussion. One might wonder how it can be that nearly all the points in the unit ball 
are very close to the surface and yet at the same time nearly all points are in a box of 
Ind 


side-length O (#4). The answer is to remember that points on the surface of the ball 


satisfy £? + x? +... + 2% =1, so for each coordinate i, a typical value will be +O (4). 
In fact, it is often helpful to think of picking a random point on the sphere as very similar 


DON P E 1 1 1 1 
to picking a random point of the form (+3, Epy nay ee 3). 
2.5 Generating Points Uniformly at Random from a Ball 


Consider generating points uniformly at random on the surface of the unit ball. For 
the 2-dimensional version of generating points on the circumference of a unit-radius cir- 
cle, independently generate each coordinate uniformly at random from the interval |—1, 1]. 
This produces points distributed over a square that is large enough to completely contain 
the unit circle. Project each point onto the unit circle. The distribution is not uniform 
since more points fall on a line from the origin to a vertex of the square than fall on a line 
from the origin to the midpoint of an edge of the square due to the difference in length. 
To solve this problem, discard all points outside the unit circle and project the remaining 
points onto the circle. 


In higher dimensions, this method does not work since the fraction of points that fall 
inside the ball drops to zero and all of the points would be thrown away. The solution is to 
generate a point each of whose coordinates is an independent Gaussian variable. Generate 
£1, T2, ..., Zq, using a zero mean, unit variance Gaussian, namely, 7 exp(—2?/2) on the 
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real line.? Thus, the probability density of x is 





Lo abate +o 
P (x) = ¿€ 2 
(27)? 
and is spherically symmetric. Normalizing the vector x = (1,,12,..., 74) to a unit vector, 


namely il? gives a distribution that is uniform over the surface of the sphere. Note that 
once the vector is normalized, its coordinates are no longer statistically independent. 


To generate a point y uniformly over the ball (surface and interior), scale the point 
Ea generated on the surface by a scalar p € [0,1]. What should the distribution of p be 
as a function of r? It is certainly not uniform, even in 2 dimensions. Indeed, the density 
of p at r is proportional to r for d = 2. For d = 3, it is proportional to r?. By similar 
reasoning, the density of p at distance r is proportional to rêt in d dimensions. Solving 
ary cr¢'dr = 1 (the integral of density must equal 1) one should set c = d. Another 
way to see this formally is that the volume of the radius r ball in d dimensions is r¢V (d). 
The density at radius r is exactly 2 (rava) = dr®!V}. So, pick p(r) with density equal to 
dr‘! for r over [0, 1). 


We have succeeded in generating a point 


x 

Y =P 
|x| 

uniformly at random from the unit ball by using the convenient spherical Gaussian dis- 

tribution. In the next sections, we will analyze the spherical Gaussian in more detail. 


2.6 Gaussians in High Dimension 


A 1-dimensional Gaussian has its mass close to the origin. However, as the dimension 
is increased something different happens. The d-dimensional spherical Gaussian with zero 
mean and variance 0? in each coordinate has density function 


1 |x|? 
p(x) = (an) o4 (E) : 


The value of the density is maximum at the origin, but there is very little volume there. 
When o? = 1, integrating the probability density over a unit ball centered at the origin 
yields almost zero mass since the volume of such a ball is negligible. In fact, one needs 





?One might naturally ask: “how do you generate a random number from a 1-dimensional Gaussian?” 
To generate a number from any distribution given its cumulative distribution function P, first select a 
uniform random number u € [0,1] and then choose x = P~!(u). For any a < b, the probability that x is 
between a and b is equal to the probability that u is between P(a) and P(b) which equals P(b) — P(a) 
as desired. For the 2-dimensional Gaussian, one can generate a point in polar coordinates by choosing 
angle 0 uniform in [0,27] and radius r = y/—21n(u) where u is uniform random in [0,1]. This is called 
the Box-Muller transform. 
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to increase the radius of the ball to nearly Vd before there is a significant volume and 
hence significant probability mass. If one increases the radius much beyond Vd, the 
integral barely increases even though the volume increases since the probability density 
is dropping off at a much higher rate. The following theorem formally states that nearly 
all the probability is concentrated in a thin annulus of width O(1) at radius Vd. 


Theorem 2.9 (Gaussian Annulus Theorem) For a d-dimensional spherical Gaussian 
with unit variance in each direction, for any B < Vd, all but at most 3e- of the prob- 
ability mass lies within the annulus vd — 8 < x| < vd + 8, where c is a fixed positive 
constant. 


For a high-level intuition, note that E(|x|?) = 2 E(x?) = dE(x%) = d, so the mean 


squared distance of a point from the center is d. “The Gaussian Annulus Theorem says 
that the points are tightly concentrated. We call the square root of the mean squared 
distance, namely Vd, the radius of the Gaussian. 


To prove the Gaussian Annulus Theorem we make use of a tail inequality for sums of 
independent random variables of bounded moments (Theorem 12.5). 


Proof (Gaussian Annulus Theorem): Let x = (21,22,...,2q) be a point selected 
from a unit variance Gaussian centered at the origin, and let r = |x|. vd- 8 < ly| < 
Vd + 8 is equivalent to |r — Vd] > 8. If |r — Vd] > 8, then multiplying both sides by 
r+ Vd gives |r? — d| > B(r + Vd) > BVd. So, it suffices to bound the probability that 
Ir? — d| > BV ad. 

Rewrite r° — d = (xf +... +27) — d = (x? — 1) +... + (x3 — 1) and perform a change 
of variables: y; = x? — 1. We want to bound the probability that ly +... + yal > Bvd. 


Notice that E(y;) = E(x?) — 1 = 0. To apply Theorem 12.5, we need to bound the s“” 
moments of y;. 


For |z;| < 1, lg? < 1 and for |z;| > 1, le < 12: Thus 


[Ew = Ellul) < BC. + 2%) = 14+ E(a7") 


= Lib Vif 2s e "Pda 


Using the substitution 2z = 27, 


|E(y;)| =1+ — = =f 28 28-G/2) 62 q 
< OP al. 


The last inequality is from the Gamma integral. 
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Since E(y;) = 0, Var(y;) = Ely?) < 222 = 8. Unfortunately, we do not have |E(y?)| < 
8s! as required in Theorem 12.5. To fix this problem, perform one more change of variables, 
using w; = y;/2. Then, Var(w;) < 2 and |E(w?)| < 2s!, and our goal is now to bound the 
probability that Jw; +... + wal > Bva Applying Theorem 12.5 where 0? = 2 and n = d, 

2 


this occurs with probability less than or equal to 30705, A 


In the next sections we will see several uses of the Gaussian Annulus Theorem. 


2.7 Random Projection and Johnson-Lindenstrauss Lemma 


One of the most frequently used subroutines in tasks involving high dimensional data 
is nearest neighbor search. In nearest neighbor search we are given a database of n points 
in R? where n and d are usually large. The database can be preprocessed and stored in 
an efficient data structure. Thereafter, we are presented “query” points in R and are 
asked to find the nearest or approximately nearest database point to the query point. 
Since the number of queries is often large, the time to answer each query should be very 
small, ideally a small function of logn and log d, whereas preprocessing time could be 
larger, namely a polynomial function of n and d. For this and other problems, dimension 
reduction, where one projects the database points to a k-dimensional space with k < d 
(usually dependent on log d) can be very useful so long as the relative distances between 
points are approximately preserved. We will see using the Gaussian Annulus Theorem 
that such a projection indeed exists and is simple. 


The projection f : R > R* that we will examine (many related projections are 
known to work as well) is the following. Pick k Gaussian vectors u1, uz,...,ux in Rf 
with unit-variance coordinates. For any vector v, define the projection f(v) by: 


f(v) = (u1 - v, uz + v,..., Uk: V). 


The projection f(v) is the vector of dot products of v with the u;. We will show that 
with high probability, |f(v)| ~ Vk|v|. For any two vectors vı and va, f(vi — va) = 
F(v1) — f (v2). Thus, to estimate the distance |v1 — va] between two vectors vı and va in 
R$, it suffices to compute |f (v1) — f(vz2)| = |f(v1 — va)| in the k-dimensional space since 
the factor of Vk is known and one can divide by it. The reason distances increase when 
we project to a lower dimensional space is that the vectors u; are not unit length. Also 
notice that the vectors u; are not orthogonal. If we had required them to be orthogonal, 
we would have lost statistical independence. 


Theorem 2.10 (The Random Projection Theorem) Let v be a fixed vector in R4 
and let f be defined as above. There exists constant c > 0 such that for e € (0,1), 


Prob (|I) = VI > evklvl ) < 367k? 
where the probability is taken over the random draws of vectors u; used to construct f. 
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Proof: By scaling both sides of the inner inequality by |v|, we may assume that |v| = 1. 
The sum of independent normally distributed real variables is also normally distributed 
where the mean and variance are the sums of the individual means and variances. Since 
uj: V= Ta u,jv;, the random variable u; - v has Gaussian density with zero mean and 


j 
unit variance, in particular, 


d d d 
Var(u; - v) = Var > uijvj | = > v; Var (uij) = > v? =] 
j=l j=l j=l 
Since Uy -V, Uz-V,..., Ug: V are independent Gaussian random variables, f(v) is a random 


vector from a k-dimensional spherical Gaussian with unit variance in each coordinate, and 
so the theorem follows from the Gaussian Annulus Theorem (Theorem 2.9) with d replaced 
by k. A 


The random projection theorem establishes that the probability of the length of the 
projection of a single vector differing significantly from its expected value is exponentially 
small in k, the dimension of the target subspace. By a union bound, the probability that 
any of O(n?) pairwise differences |v; — v;| among n vectors Vi,..., Vn differs significantly 
from their expected values is small, provided k > > Inn. Thus, this random projection 
preserves all relative pairwise distances between points in a set of n points with high 
probability. This is the content of the Johnson-Lindenstrauss Lemma. 


Theorem 2.11 (Johnson-Lindenstrauss Lemma) For any0 < € < 1 and any integer 
n, let k > 5 Inn with c as in Theorem 2.9. For any set of n points in Rt, the random 
projection f : R? — RF defined above has the property that for all pairs of points vi and 
vj, with probability at least 1 — 3/2n, 


(1 — €) Vk [vi — vj] < Ifv) — f(v] < (+ £)vk |vi — vjl 


Proof: Applying the Random Projection Theorem (Theorem 2.10), for any fixed v; and 
vj, the probability that |f (vi — v;)| is outside the range 


[0 — e)Vklvi — vil, (1 +0) Vivi — vil] 





is at most 3e* < 3/n* for k > ¿2% Since there are (5) < n?/2 pairs of points, by the 


ce? 2 
union bound, the probability that any pair has a large distortion is less than >. A 


Remark: It is important to note that the conclusion of Theorem 2.11 asserts for all v; 
and vj, not just for most of them. The weaker assertion for most v; and vj is typically less 
useful, since our algorithm for a problem such as nearest-neighbor search might return 
one of the bad pairs of points. A remarkable aspect of the theorem is that the number 
of dimensions in the projection is only dependent logarithmically on n. Since k is often 
much less than d, this is called a dimension reduction technique. In applications, the 
dominant term is typically the 1/2? term. 
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For the nearest neighbor problem, if the database has nı points and na queries are 
expected during the lifetime of the algorithm, take n = nı + Na and project the database 
to a random k-dimensional space, for k as in Theorem 2.11. On receiving a query, project 
the query to the same subspace and compute nearby database points. The Johnson 
Lindenstrauss Lemma says that with high probability this will yield the right answer 
whatever the query. Note that the exponentially small in k probability was useful here in 
making k only dependent on Inn, rather than n. 


2.8 Separating Gaussians 


Mixtures of Gaussians are often used to model heterogeneous data coming from multiple 
sources. For example, suppose we are recording the heights of individuals age 20-30 in a 
city. We know that on average, men tend to be taller than women, so a natural model 
would be a Gaussian mixture model p(x) = w,p,(1) + wap2(1), where p,(x) is a Gaussian 
density representing the typical heights of women, p(x) is a Gaussian density represent- 
ing the typical heights of men, and wı and wa are the mixture weights representing the 
proportion of women and men in the city. The parameter estimation problem for a mixture 
model is the problem: given access to samples from the overall density p (e.g., heights of 
people in the city, but without being told whether the person with that height is male 
or female), reconstruct the parameters for the distribution (e.g., good approximations to 
the means and variances of pı and pa, as well as the mixture weights). 


There are taller women and shorter men, so even if one solved the parameter estima- 
tion problem for heights perfectly, given a data point, one couldn’t necessarily tell which 
population it came from. That is, given a height, one couldn’t necessarily tell if it came 
from a man or a woman. In this section, we will look at a problem that is in some ways 
easier and some ways harder than this problem of heights. It will be harder in that we 
will be interested in a mixture of two Gaussians in high-dimensions as opposed to the 
d = 1 case of heights. But it will be easier in that we will assume the means are quite 
well-separated compared to the variances. Specifically, our focus will be on a mixture of 
two spherical unit-variance Gaussians whose means are separated by a distance Q(d'/*). 
We will show that at this level of separation, we can with high probability uniquely de- 
termine which Gaussian each data point came from. The algorithm to do so will actually 
be quite simple. Calculate the distance between all pairs of points. Points whose distance 
apart is smaller are from the same Gaussian, whereas points whose distance is larger are 
from different Gaussians. Later, we will see that with more sophisticated algorithms, even 
a separation of Q(1) suffices. 


First, consider just one spherical unit-variance Gaussian centered at the origin. From 
Theorem 2.9, most of its probability mass lies on an annulus of width O(1) at radius Vd. 
Also e7*/? = J], e~**/? and almost all of the mass is within the slab { x | —c < z1 <c}, 
for c € O(1). Pick a point x from this Gaussian. After picking x, rotate the coordinate 
system to make the first axis align with x. Independently pick a second point y from 
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(a) (b) 


Figure 2.5: (a) indicates that two randomly chosen points in high dimension are surely 
almost nearly orthogonal. (b) indicates the distance between a pair of random points 
from two different unit balls approximating the annuli of two Gaussians. 


this Gaussian. The fact that almost all of the probability mass of the Gaussian is within 
the slab {x | — c < a, < c, c € O(1)) at the equator implies that y’s component along 
x’s direction is O(1) with high probability. Thus, y is nearly perpendicular to x. So, 
Ix — y| = \/|x|? + |y|?. See Figure 2.5(a). More precisely, since the coordinate system 
has been rotated so that x is at the North Pole, x = (Vd + O(1),0,...,0). Since y is 
almost on the equator, further rotate the coordinate system so that the component of 
y that is perpendicular to the axis of the North Pole is in the second coordinate. Then 
y = (O(1), vd + O(1),0,...,0). Thus, 


(x-y) =d+O(Vd) + d+ O(Vad) = 2d + O(V 4) 
and |x — y| = V2d + O(1) with high probability. 


Consider two spherical unit variance Gaussians with centers p and q separated by a 
distance A. The distance between a randomly chosen point x from the first Gaussian 
and a randomly chosen point y from the second is close to VA? + 2d, since x — p, p — q, 
and q — y are nearly mutually perpendicular. Pick x and rotate the coordinate system 
so that x is at the North Pole. Let z be the North Pole of the ball approximating the 
second Gaussian. Now pick y. Most of the mass of the second Gaussian is within O(1) 
of the equator perpendicular to z — q. Also, most of the mass of each Gaussian is within 
distance O(1) of the respective equators perpendicular to the line q — p. See Figure 2.5 
(b). Thus, 


Ix -yl = A? + [z — ql? + [q — yl? 
= A? +2d+0(vd). 





To ensure that the distance between two points picked from the same Gaussian are 
closer to each other than two points picked from different Gaussians requires that the 
upper limit of the distance between a pair of points from the same Gaussian is at most 
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the lower limit of distance between points from different Gaussians. This requires that 
V2d+O(1) < v2d + A2-0(1) or 2d+O(Vd) < 2d+A?, which holds when A € w(d"/*). 
Thus, mixtures of spherical Gaussians can be separated in this way, provided their centers 
are separated by w(d/%). If we have n points and want to correctly separate all of 
them with high probability, we need our individual high-probability statements to hold 
with probability 1 — 1/poly(n),* which means our O(1) terms from Theorem 2.9 become 
O(vlogn). So we need to include an extra O(ylog n) term in the separation distance. 


Algorithm for separating points from two Gaussians: Calculate all 
pairwise distances between points. The cluster of smallest pairwise distances 
must come from a single Gaussian. Remove these points. The remaining 
points come from the second Gaussian. 


One can actually separate Gaussians where the centers are much closer. In the next 
chapter we will use singular value decomposition to separate points from a mixture of two 
Gaussians when their centers are separated by a distance O(1). 


2.9 Fitting a Spherical Gaussian to Data 


Given a set of sample points, x1, X2,...,Xn, in a d-dimensional space, we wish to find 
the spherical Gaussian that best fits the points. Let f be the unknown Gaussian with 
mean p and variance g° in each direction. The probability density for picking these points 
when sampling according to f is given by 


c exp (- soet pl) 





20? 


x-y? |” 
where the normalizing constant c is the reciprocal of | f E de . In integrating from 





k2 | 
—oo to oo, one can shift the origin to u and thus c is fe Bar| = — and is 


(27) 2 


independent of m. 


The Maximum Likelihood Estimator (MLE) of f, given the samples x1, X2,...,Xn, is 
the f that maximizes the above probability density. 





Lemma 2.12 Let {x1,xX2,...,Xn} be a set of n d-dimensional points. Then (xı — py -+ 
(x3 — U)? +- --+(xn — p)? is minimized when p is the centroid of the points X1, X2, . . . , Xn, 
namely p = ¿(X1 + X2 ++: + Xn). 


Proof: Setting the gradient of (x, — u)? + (x2 — 1) +--+ (xn — u)? with respect to u 
to zero yields 
-2x1 — p) — 2 (x2 — px) — +++ 2 (xn — y) = 0. 


Solving for w gives p = +(x, + x2 +-+- + Xn). E 








3poly(n) means bounded by a polynomial in n. 
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To determine the maximum likelihood estimate of o? for f, set ps to the true centroid. 
Next, show that ø is set to the standard deviation of the sample. Substitute v = 5 and 


a = (xı — H) + (x2 — pw)? +--+ (xn — H}? into the formula for the probability of picking 
the points X1, X2,...,X,. This gives 


—av 


wae 
J do 


Now, a is fixed and v is to be determined. Taking logs, the expression to maximize is 


a 
—av — nln fe “2 dx 


x 


To find the maximum, differentiate with respect to v, set the derivative to zero, and solve 


for o. The derivative is A 
J |e’ e ay 
T 








fe? dx 
Setting y = |\/vx| in the derivative, yields 
fye dy 
-a+—* 
v fe? dy 


y 


Since the ratio of the two integrals is the expected distance squared of a d- See 
spherical pee of standard oe Z to its center, and this is known to be 2 5) we 
get =a + 35 “. Substituting 0? for >, gives —a + ndo?. Setting —a + ndo? = 0 shows that 


the maximum occurs when o = ae Note that this quantity is the square root of the 


average coordinate distance squared of the samples to their mean, which is the standard 
deviation of the sample. Thus, we get the following lemma. 


Lemma 2.13 The maximum likelihood spherical Gaussian for a set of samples is the 
Gaussian with center equal to the sample mean and standard deviation equal to the stan- 
dard deviation of the sample from the true mean. 


Let x1, X2,...,Xn be a sample of points generated by a Gaussian probability distri- 
bution. Then p = +(x) Exa +- + Xn) is an unbiased estimator of the expected value 
of the distribution. However, if in estimating the variance from the sample set, we use 
the estimate of the expected value rather than the true expected value, we will not get 
an unbiased estimate of the variance, since the sample mean is not independent of the 
sample set. One should use 1 = — (xı + x2 +-*- + Xp) when estimating the variance. 


n—1 


See Section 12.4.10 of the appendix. 
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2.10 Bibliographic Notes 


The word vector model was introduced by Salton [SWY75]. There is vast literature on 
the Gaussian distribution, its properties, drawing samples according to it, etc. The reader 
can choose the level and depth according to his/her background. The Master Tail Bounds 
theorem and the derivation of Chernoff and other inequalities from it are from [Kan09]. 
The original proof of the Random Projection Theorem by Johnson and Lindenstrauss was 
complicated. Several authors used Gaussians to simplify the proof. The proof here is due 
to Dasgupta and Gupta [DG99]. See [Vem04] for details and applications of the theorem. 
[MUO5] and [MR95b] are text books covering much of the material touched upon here. 
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2.11 Exercises 


Exercise 2.1 


1. Let x and y be independent random variables with uniform distribution in [0, 1]. 
What is the expected value E(x), E(x”), E(x — y), E(xy), and E(x — y)? ? 

2. Let x and y be independent random variables with uniform distribution in [—3, 3]. 
What is the expected value E(x), E(x”), Elx — y), E(xy), and E(x — y)? 


3. What is the expected squared distance between two points generated at random inside 
a unit d-dimensional cube? 


1 27100 


Exercise 2.2 Randomly generate 30 points inside the cube [—>, 5 and plot distance 


between points and the angle between the vectors from the origin to the points for all pairs 
of points. 


Exercise 2.3 Show that for any a > 1 there exist distributions for which Markov’s in- 
equality is tight by showing the following: 


1. For each a = 2,3, and 4 give a probability distribution p(x) for a non-negative 
random variable x where Prob (x > a) = a 


2. For arbitrary a > 1 give a probability distribution for a non-negative random variable 
x where Prob (x > a) = EU 


Exercise 2.4 Show that for any c > 1 there exist distributions for which Chebyshev’s 
inequality is tight, in other words, Prob(|x — E(x)| > c) = Var(x)/c?. 


Exercise 2.5 Let x be a random variable with probability density z forO<a2<4 and 
zero elsewhere. 


1. Use Markov’s inequality to bound the probability that x > 3. 
2. Make use of Prob(|x| > a) = Prob(x? > a?) to get a tighter bound. 
3. What is the bound using Prob(|x| > a) = Prob(1" > a”)? 


Exercise 2.6 Consider the probability distribution p(x = 0) = 1 — + and p(x = a) = =. 
Plot the probability that x is greater than or equal to a as a function of a for the bound 
given by Markov’s inequality and by Markov’s inequality applied to x? and x*. 


Exercise 2.7 Consider the probability density function p(x) = 0 forx < 1 and p(x) = ch 
forx >l. 


1. What should c be to make p a legal probability density function? 


2. Generate 100 random samples from this distribution. How close is the average of 
the samples to the expected value of x? 
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Exercise 2.8 Let U be a set of integers and X and Y be subsets of U where XAY is 
1/10 of U. Prove that the probability that none of the elements selected at random from 
U will be in XAY is less than e 17, 

1 


Exercise 2.9 Let G be a d-dimensional spherical Gaussian with variance 5 in each di- 


rection, centered at the origin. Derive the expected squared distance to the origin. 


Exercise 2.10 Consider drawing a random point x on the surface of the unit sphere in 
RY. What is the variance of x, (the first coordinate of x)? See if you can give an argument 
without doing any integrals. 


Exercise 2.11 How large must e be for 99% of the volume of a 1000-dimensional unit- 
radius ball to lie in the shell of e-thickness at the surface of the ball? 


Exercise 2.12 Prove that 1+ a < e” for all real x. For what values of x is the approxi- 
mation 1+ x= e” within 0.01? 


Exercise 2.13 For what value of d does the volume, V(d), of a d-dimensional unit ball 
V(d) 


take on its maximum? Hint: Consider the ratio Waa: 





Exercise 2.14 A 3-dimensional cube has vertices, edges, and faces. In a d-dimensional 
cube, these components are called faces. A vertex is a 0-dimensional face, an edge a 
1-dimensional face, etc. 


1. For0 < k < d, how many k-dimensional faces does a d-dimensional cube have? 


2. What is the total number of faces of all dimensions? The d-dimensional face is the 
cube itself which you can include in your count. 


3. What is the surface area of a unit cube in d-dimensions (a unit cube has side-length 
one in each dimension)? 


4. What is the surface area of the cube if the length of each side was 2? 


5. Prove that the volume of a unit cube is close to its surface. 


Exercise 2.15 Consider the portion of the surface area of a unit radius, 3-dimensional 
ball with center at the origin that lies within a circular cone whose vertex is at the origin. 
What is the formula for the incremental unit of area when using polar coordinates to 
integrate the portion of the surface area of the ball that is lying inside the circular cone? 
What is the formula for the integral? What is the value of the integral if the angle of the 
cone is 36°? The angle of the cone is measured from the axis of the cone to a ray on the 
surface of the cone. 
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Exercise 2.16 Consider a unit radius, circular cylinder in 3-dimensions of height one. 
The top of the cylinder could be an horizontal plane or half of a circular ball. Consider 
these two possibilities for a unit radius, circular cylinder in 4-dimensions. In 4-dimensions 
the horizontal plane is 3- dimensional and the half circular ball is 4-dimensional. In each 
of the two cases, what is the surface area of the top face of the cylinder? You can use 
V(d) for the volume of a unit radius, d-dimension ball and A(d) for the surface area of 
a unit radius, d-dimensional ball. An infinite length, unit radius, circular cylinder in 4- 
dimensions would be the set [(11, £2, £3, 14) |£ +23 + 14 < 1} where the coordinate x, is 
the axis. 


Exercise 2.17 Given a d-dimensional circular cylinder of radius r and height h 
1. What is the surface area in terms of V(d) and A(d)? 


2. What ts the volume? 


Exercise 2.18 How does the volume of a ball of radius two behave as the dimension of 
the space increases? What if the radius was larger than two but a constant independent 
of d? What function of d would the radius need to be for a ball of radius r to have 
approximately constant volume as the dimension increases? Hint: you may want to use 
Stirling’s approximation, n! = je for factorial. 


Exercise 2.19 If lim V(d) =0, the volume of a d-dimensional ball for sufficiently large 
— oo 


d must be less than V(3). How can this be if the d-dimensional ball contains the three 
dimensional ball? 


Exercise 2.20 


1. Write a recurrence relation for V (d) in terms of V (d—1) by integrating over zı. 
Hint: At x, = t, the (d — 1)-dimensional volume of the slice is the volume of a 
(d—1)-dimensional sphere of radius V1 — t?. Express this in terms of V(d—1) and 
write down the integral. You need not evaluate the integral. 


2. Verify the formula for d = 2 and d = 3 by integrating and comparing with V(2) = m 
and V(3) = $r 
Exercise 2.21 Consider a unit ball A centered at the origin and a unit ball B whose 
center is at distance s from the origin. Suppose that a random point x is drawn from 
the mixture distribution: “with probability 1/2, draw at random from A; with probability 
1/2, draw at random from B”. Show that a separation s > 1/yYd— 1 is sufficient so that 
Prob(a € AN B) = o(1); i.e., for any e > 0 there exists c such that if s > c/ Vd — 1, then 
Prob(a € AN B) <e. In other words, this extent of separation means that nearly all of 
the mixture distribution is identifiable. 
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Exercise 2.22 Consider the upper hemisphere of a unit-radius ball in d-dimensions. 
What is the height of the maximum volume cylinder that can be placed entirely inside 
the hemisphere? As you increase the height of the cylinder, you need to reduce the cylin- 
der’s radius so that it will lie entirely within the hemisphere. 


Exercise 2.23 What is the volume of the maximum size d-dimensional hypercube that 
can be placed entirely inside a unit radius d-dimensional ball? 


Exercise 2.24 Calculate the ratio of area above the plane x, = c to the area of the upper 
hemisphere of a unit radius ball in d-dimensions for e = 0.001, 0.01, 0.02, 0.03, 0.04, 0.05 
and for d = 100 and d = 1,000. 


Exercise 2.25 Almost all of the volume of a ball in high dimensions lies in a narrow 
slice of the ball at the equator. However, the narrow slice is determined by the point on 
the surface of the ball that 1s designated the North Pole. Explain how this can be true 
if several different locations are selected for the location of the North Pole giving rise to 
different equators. 


Exercise 2.26 Explain how the volume of a ball in high dimensions can simultaneously 
be in a narrow slice at the equator and also be concentrated in a narrow annulus at the 
surface of the ball. 


Exercise 2.27 Generate 500 points uniformly at random on the surface of a unit-radius 
ball in 50 dimensions. Then randomly generate five additional points. For each of the five 
new points, calculate a narrow band of width 2 at the equator, assuming the point was 
the North Pole. How many of the 500 points are in each band corresponding to one of the 
five equators? How many of the points are in all five bands? How wide do the bands need 
to be for all points to be in all five bands? 


Exercise 2.28 Place 100 points at random on a d-dimensional unit-radius ball. Assume 
d is large. Pick a random vector and let it define two parallel hyperplanes on opposite 
sides of the origin that are equal distance from the origin. How close can the hyperplanes 
be moved and still have at least a .99 probability that all of the 100 points land between 
them? 


Exercise 2.29 Let x and y be d-dimensional zero mean, unit variance Gaussian vectors. 
Prove that x and y are almost orthogonal by considering their dot product. 


Exercise 2.30 Prove that with high probability, the angle between two random vectors in 
a high-dimensional space is at least 45°. Hint: use Theorem 2.8. 


Exercise 2.31 Project the volume of a d-dimensional ball of radius Vd onto a line 


through the center. For large d, give an intuitive argument that the projected volume 
should behave like a Gaussian. 
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Exercise 2.32 


1. Write a computer program that generates n points uniformly distributed over the 
surface of a unit-radius d-dimensional ball. 


2. Generate 200 points on the surface of a sphere in 50 dimensions. 


3. Create several random lines through the origin and project the points onto each line. 
Plot the distribution of points on each line. 


4. What does your result from (3) say about the surface area of the sphere in relation 
to the lines, i.e., where is the surface area concentrated relative to each line? 


Exercise 2.33 If one generates points in d-dimensions with each coordinate a unit vari- 
ance Gaussian, the points will approximately lie on the surface of a sphere of radius Vd. 


1. What is the distribution when the points are projected onto a random line through 
the origin? 


2. If one uses a Gaussian with variance four, where in d-space will the points lie? 


Exercise 2.34 Randomly generate a 100 points on the surface of a sphere in 3-dimensions 
and in 100-dimensions. Create a histogram of all distances between the pairs of points in 
both cases. 


Exercise 2.35 We have claimed that a randomly generated point on a ball lies near the 
equator of the ball, independent of the point picked to be the North Pole. Is the same claim 
true for a randomly generated point on a cube? To test this claim, randomly generate ten 
+1 valued vectors in 128 dimensions. Think of these ten vectors as ten choices for the 
North Pole. Then generate some additional +1 valued vectors. To how many of the 
original vectors is each of the new vectors close to being perpendicular; that is, how many 
of the equators is each new vector close to? 


Exercise 2.36 Define the equator of a d-dimensional unit cube to be the hyperplane 


(Es 


1. Are the vertices of a unit cube concentrated close to the equator? 





2. Is the volume of a unit cube concentrated close to the equator? 


3. Is the surface area of a unit cube concentrated close to the equator? 


Exercise 2.37 Consider a non-orthogonal basis es,ez,...,€q. The e; are a set of linearly 
independent unit vectors that span the space. 


1. Prove that the representation of any vector in this basis is unique. 
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2. Calculate the squared length of z = (2, 1). where z is expressed in the basis e, = 
(1,0) and e2 = (Eu. v2) 

3. Ify = Yo, aie; and z = $`; biei, with O < a; < bi, is it necessarily true that the 
length of z is greater than the length of y? Why or why not? 


4. Consider the basis ey = (1,0) and eg = EL, ar 


(a) What is the representation of the vector (0,1) in the basis (e1, e2). 
: . V2 v2 

(b) What is the representation of the vector (+, 4)? 

(c) What is the representation of the vector (1,2)? 


e2 ez e2 











Exercise 2.38 Generate 20 points uniformly at random on a 900-dimensional sphere of 
radius 30. Calculate the distance between each pair of points. Then, select a method of 
projection and project the data onto subspaces of dimension k=100, 50, 10, 5, 4, 3, 2, 1 
and calculate the difference between Vk times the original distances and the new pair-wise 
distances. For each value of k what is the maximum difference as a percent of Vk. 


Exercise 2.39 What happens in high dimension to a lower dimensional manifold? To 
see what happens, consider a sphere of dimension 100 in a 1,000 dimensional space if the 
1,000 dimensional space is projected to a random 500 dimensional space. Will the sphere 
remain essentially spherical? Given an intuitive argument justifying your answer. 


Exercise 2.40 In d-dimensions there are exactly d-unit vectors that are pairwise orthog- 
onal. However, if you wanted a set of vectors that were almost orthogonal you might 
squeeze in a few more. For example, in 2-dimensions if almost orthogonal meant at least 
45 degrees apart, you could fit in three almost orthogonal vectors. Suppose you wanted to 
find 1000 almost orthogonal vectors in 100 dimensions. Here are two ways you could do 
it: 


1. Begin with 1,000 orthonormal 1,000-dimensional vectors, and then project them to 
a random 100-dimensional space. 


2. Generate 1000 100-dimensional random Gaussian vectors. 
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Implement both ideas and compare them to see which does a better job. 


Exercise 2.41 Suppose there is an object moving at constant velocity along a straight 
line. You receive the gps coordinates corrupted by Gaussian noise every minute. How do 
you estimate the current position? 


Exercise 2.42 


1. What is the maximum size rectangle that can be fitted under a unit variance Gaus- 
sian? 


2. What unit area rectangle best approximates a unit variance Gaussian if one measure 
goodness of fit by the symmetric difference of the Gaussian and the rectangle. 


Exercise 2.43 Let x1,X2,...,Xn be independent samples of a random variable x with 
n 

mean u and variance o°. Let ms = DD X; be the sample mean. Suppose one estimates 
i=l 

the variance using the sample mean rather than the true mean, that is, 


n 


o= +5 - m.) 


Jemi 
Prove that Elo?) = o” and thus one should have divided by n — 1 rather than n. 


Hint: First calculate the variance of the sample mean and show that var(ms) = +var(x). 
Then calculate E(o2) = E[+ X; (xi—msg)”] by replacing x;—ms with (x;—m)-—(m,—m). 
Exercise 2.44 Generate ten values by a Gaussian probability distribution with zero mean 
and variance one. What is the center determined by averaging the points? What is the 
variance? In estimating the variance, use both the real center and the estimated center. 
When using the estimated center to estimate the variance, use both n = 10 and n = 9. 
How do the three estimates compare? 


Exercise 2.45 Suppose you want to estimate the unknown center of a Gaussian in d- 
space which has variance one in each direction. Show that O(log d/2?) random samples 
from the Gaussian are sufficient to get an estimate m, of the true center u, so that with 
probability at least 99%, 

lt — melles < e. 


How many samples are sufficient to ensure that with probability at least 99% 
| — m,]]2 < e? 


1 (2-5 


‘ ss dejo $ e 
Exercise 2.46 Use the probability distribution —~e~2 9 — to generate ten points. 


3/20 
a) From the ten points estimate u. How close is the estimate of u to the true mean of 
H H 
5? 
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10 
(b) Using the true mean of 5, estimate 0? by the formula 0? = E X (z; — 5}. How close 


i=1 
is the estimate of o° to the true variance of 9? 


10 
(c) Using your estimate m of the mean, estimate 0? by the formula 0? = + (7 — my. 
i= 
How close is the estimate of o? to the true variance of 9? 
10 
(d) Using your estimate m of the mean, estimate o° by the formula o? = 5 2 (vi — my. 
i= 


How close is the estimate of 0? to the true variance of 9? 


Exercise 2.47 Create a list of the five most important things that you learned about high 
dimensions. 


Exercise 2.48 Write a short essay whose purpose is to excite a college freshman to learn 
about high dimensions. 
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3 Best-Fit Subspaces and Singular Value Decompo- 
sition (SVD) 


3.1 Introduction 


In this chapter, we examine the Singular Value Decomposition (SVD) of a matrix. 
Consider each row of an n x d matrix A as a point in d-dimensional space. The singular 
value decomposition finds the best-fitting k-dimensional subspace for k = 1,2,3,..., for 
the set of n data points. Here, “best” means minimizing the sum of the squares of the 
perpendicular distances of the points to the subspace, or equivalently, maximizing the 
sum of squares of the lengths of the projections of the points onto this subspace.* We 
begin with a special case where the subspace is 1-dimensional, namely a line through the 
origin. We then show that the best-fitting k-dimensional subspace can be found by k 
applications of the best fitting line algorithm, where on the 7“ iteration we find the best 
fit line perpendicular to the previous 7 — 1 lines. When k reaches the rank of the matrix, 
from these operations we get an exact decomposition of the matrix called the singular 
value decomposition. 


In matrix notation, the singular value decomposition of a matrix A with real entries 
(we assume all our matrices have real entries) is the factorization of A into the product 
of three matrices, A =UDV”, where the columns of U and V are orthonormal? and the 
matrix D is diagonal with positive real entries. The columns of V are the unit length vec- 
tors defining the best fitting lines described above (the i” column being the unit-length 
vector in the direction of the i” line). The coordinates of a row of U will be the fractions 
of the corresponding row of A along the direction of each of the lines. 


The SVD is useful in many tasks. Often a data matrix A is close to a low rank ma- 
trix and it is useful to find a good low rank approximation to A. For any k, the singular 
value decomposition of A gives the best rank-k approximation to A in a well-defined sense. 


If u; and v; are columns of U and V respectively, then the matrix equation A = U DVT 


can be rewritten as 
A= > dijUivio. 
i 


Since u; is a n x 1 matrix and v; is a d x 1 matrix, u¡v;” is an n x d matrix with the 
same dimensions as A. The i“ term in the above sum can be viewed as giving the compo- 
nents of the rows of A along direction v;. When the terms are summed, they reconstruct A. 





4This equivalence is due to the Pythagorean Theorem. For each point, its squared length (its distance 
to the origin squared) is exactly equal to the squared length of its projection onto the subspace plus the 
squared distance of the point to its projection; therefore, maximizing the sum of the former is equivalent 
to minimizing the sum of the latter. For further discussion see Section 3.2. 

5A set of vectors is orthonormal if each is of length one and they are pairwise orthogonal. 
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This decomposition of A can be viewed as analogous to writing a vector x in some 
orthonormal basis v1, V2,...,Vq. The coordinates of x = (X-V1,X-Va2...,X- Va) are the 
projections of x onto the v;'s. For SVD, this basis has the property that for any k, the 
first k vectors of this basis produce the least possible total sum of squares error for that 
value of k. 


In addition to the singular value decomposition, there is an eigenvalue decomposition. 
Let A be a square matrix. A vector v such that Av = Av is called an eigenvector and 
A the eigenvalue. When A is symmetric, the eigenvectors are orthogonal and A can be 
expressed as A = VWDV” where the eigenvectors are the columns of V and D is a diag- 
onal matrix with the corresponding eigenvalues on its diagonal. For a symmetric matrix 
A the singular values are the absolute values of the eigenvalues. Some eigenvalues may 
be negative but all singular values are positive by definition. If the singular values are 
distinct, then A’s right singular vectors and eigenvectors are identical. The left singular 
vectors of A are identical with the right singular vectors of A when the corresponding 
eigenvalues are positive and are the negative of the right singular vectors when the corre- 
sponding eigenvalues are negative. If a singular value has multiplicity d greater than one, 
the corresponding singular vectors span a subspace of dimension d and any orthogonal 
basis of the subspace can be used as the eigenvectors or singular vectors.* 


The singular value decomposition is defined for all matrices, whereas the more fa- 
miliar eigenvector decomposition requires that the matrix A be square and certain other 
conditions on the matrix to ensure orthogonality of the eigenvectors. In contrast, the 
columns of V in the singular value decomposition, called the right-singular vectors of A, 
always form an orthogonal set with no assumptions on A. The columns of U are called 
the left-singular vectors and they also form an orthogonal set (see Section 3.6). A simple 
consequence of the orthonormality is that for a square and invertible matrix A, the inverse 
of A is VDU7. 


Eigenvalues and eignevectors satisfy Av = Av. We will show that singular values and 
vectors satisfy a somewhat analogous relationship. Since Av; is a n x 1 matrix (vector), 
the matrix A cannot act on it from the left. But A’, which is a d x n matrix, can act on 
this vector. Indeed, we will show that 


Av; = dau; and Afu; = divi. 


In words, A acting on v; produces a scalar multiple of u; and AT acting on u; produces 
the same scalar multiple of v;. Note that A? Av; = d?v;. The i singular vector of A is 
the i” eigenvector of the square symmetric matrix ATA. 





SWhen d = 1 there are actually two possible singular vectors, one the negative of the other. The 
subspace spanned is unique. 
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Minimizing J` dist? is equiv- 
i 


alent to maximizing Y` proj? 











Figure 3.1: The projection of the point a; onto the line through the origin in the direction 
of v. 


3.2 Preliminaries 


Consider projecting a point aj = (a;¡, @i2,..., @iq) onto a line through the origin. Then 
a, +4 +++» +2, = (length of projection)’ + (distance of point to line)? . 
This holds by the Pythagorean Theorem (see Figure 3.1). Thus 


(distance of point to line)? = a?, + a3, +--- + a% — (length of projection)’ . 


Since 7 (a7, + ajz +--+ + aja) is a constant independent of the line, minimizing the sum 


of the stes of the distances to the line is equivalent to maximizing the sum of the 
squares of the lengths of the projections onto the line. Similarly for best-fit subspaces, 
maximizing the sum of the squared lengths of the projections onto the subspace minimizes 
the sum of squared distances to the subspace. 


Thus we have two interpretations of the best-fit subspace. The first is that it minimizes 
the sum of squared distances of the data points to it. This first interpretation and its use 
are akin to the notion of least-squares fit from calculus.’ The second interpretation of 
best-fit-subspace is that it maximizes the sum of projections squared of the data points 
on it. This says that the subspace contains the maximum content of data among all 
subspaces of the same dimension. The choice of the objective function as the sum of 
squared distances seems a bit arbitrary and in a way it is. But the square has many nice 
mathematical properties. The first of these, as we have just seen, is that minimizing the 
sum of squared distances is equivalent to maximizing the sum of squared projections. 





“But there is a difference: here we take the perpendicular distance to the line or subspace, whereas, 
in the calculus notion, given n pairs, (11, y1), (12, Y2), -- -, (Un, Yn), we find a line l = {(x, y)ly = ma +b} 
minimizing the vertical squared distances of the points to it, namely, 07", (y; — ma; — b)’. 
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3.3 Singular Vectors 


We now define the singular vectors of an n x d matrix A. Consider the rows of A as 
n points in a d-dimensional space. Consider the best fit line through the origin. Let v 
be a unit vector along this line. The length of the projection of aj, the i” row of A, onto 
v is |a; - v|. From this we see that the sum of the squared lengths of the projections is 
|Av|?. The best fit line is the one maximizing |Av|? and hence minimizing the sum of the 
squared distances of the points to the line. 


With this in mind, define the first singular vector vı of A as 


vı = a | Av]. 
Technically, there may be a tie for the vector attaining the maximum and so we should 
not use the article “the”; in fact, —v1 is always as good as vı. In this case, we arbitrarily 
pick one of the vectors achieving the maximum and refer to it as “the first singular vector” 
avoiding the more cumbersome “one of the vectors achieving the maximum”. We adopt 
this terminology for all uses of arg max. 


The value 0, (A) = |4v1] is called the first singular value of A. Note that o? = 
S” (aj: v1)? is the sum of the squared lengths of the projections of the points onto the line 
i=l 
determined by vı. 


If the data points were all either on a line or close to a line, intuitively, vı should 
give us the direction of that line. It is possible that data points are not close to one 
line, but lie close to a 2-dimensional subspace or more generally a low dimensional space. 
Suppose we have an algorithm for finding vı (we will describe one such algorithm later). 
How do we use this to find the best-fit 2-dimensional plane or more generally the best fit 
k-dimensional space? 


The greedy approach begins by finding vı and then finds the best 2-dimensional 
subspace containing vı. The sum of squared distances helps. For every 2-dimensional 
subspace containing vı, the sum of squared lengths of the projections onto the subspace 
equals the sum of squared projections onto vı plus the sum of squared projections along 
a vector perpendicular to vı in the subspace. Thus, instead of looking for the best 2- 
dimensional subspace containing v1, look for a unit vector va perpendicular to vı that 
maximizes |Av|? among all such unit vectors. Using the same greedy strategy to find the 
best three and higher dimensional subspaces, defines vg, va,... in a similar manner. This 
is captured in the following definitions. There is no apriori guarantee that the greedy 
algorithm gives the best fit. But, in fact, the greedy algorithm does work and yields the 
best-fit subspaces of every dimension as we will show. 


The second singular vector, va, is defined by the best fit line perpendicular to vy. 
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V2 = arg max |Av| 
vivy 
|v|=1 


The value 02 (4) = |Avə| is called the second singular value of A. The third singular 
vector v3 and the third singular value are defined similarly by 


V3 = arg max | Av| 


vivi,v2 
lv|=1 
and 
o3(A) = |Avs|, 
and so on. The process stops when we have found singular vectors V1, V2,..., Vr, Singular 
values 01, 09,...,0,, and 


max |Av|=0. 
vLv1,V2,...,Vr 
[v|=1 


The greedy algorithm found the vı that maximized |Av| and then the best fit 2- 
dimensional subspace containing vı. Is this necessarily the best-fit 2-dimensional sub- 
space overall? The following theorem establishes that the greedy algorithm finds the best 
subspaces of every dimension. 


Theorem 3.1 (The Greedy Algorithm Works) Let A be ann xd matriz with singu- 
lar vectors V1, V2,..., Vr. For1 < k <r, let Vy be the subspace spanned by v1, V2,...,Vk- 
For each k, Vi is the best-fit k-dimensional subspace for A. 


Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-fit 2- 
dimensional subspace for A. For any orthonormal basis (w1, w2) of W, |Aw,|? + |Aw2|? 
is the sum of squared lengths of the projections of the rows of A onto W. Choose an 
orthonormal basis (w1, w2) of W so that we is perpendicular to vı. If vi is perpendicular 
to W, any unit vector in W will do as we. If not, choose wa to be the unit vector in W 
perpendicular to the projection of vı onto W. This makes wz perpendicular to v1.8 Since 
vı maximizes |Av|?, it follows that |Aw,|? < |Avi|*. Since vz maximizes |Av|? over all 
v perpendicular to v1, |Aw2|? < |Av2|?. Thus 


[Aw]? + |Aw2|? < |Avy |? + | Av2|?. 
Hence, Va is at least as good as W and so is a best-fit 2-dimensional subspace. 


For general k, proceed by induction. By the induction hypothesis, V,_1 is a best-fit 
k-1 dimensional subspace. Suppose W is a best-fit k-dimensional subspace. Choose an 





8This can be seen by noting that vı is the sum of two vectors that each are individually perpendicular 
to w2, namely the projection of vı to W and the portion of vı orthogonal to W. 
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orthonormal basis w1, W2,..., Wx of W so that wy, is perpendicular to V1,V2,...,Vk-1- 
Then 


[Aw]? + | Aw2|? +--+ [Awk] < |Avy|? + | Av2|? freee ft | Avia? 


since V,_; is an optimal k — 1 dimensional subspace. Since wy is perpendicular to 
v1,V2,..., Vx-1, by the definition of vx, | Aw. < | Avy. Thus 


[Aw, |? + |Awe|? +--+ +] 4wx-11* + |Awe|? < Avi]? + | Ave]? + +++ + | Avia)? + |Avel?, 
proving that V;, is at least as good as W and hence is optimal. A 


Note that the n-dimensional vector Av; is a list of lengths (with signs) of the projec- 
tions of the rows of A onto v;. Think of |Av;| = 0;(4) as the component of the matrix 
A along v;. For this interpretation to make sense, it should be true that adding up the 
squares of the components of A along each of the v; gives the square of the “whole content 
of A”. This is indeed the case and is the matrix analogy of decomposing a vector into its 
components along orthogonal directions. 


Consider one row, say aj, of A. Since v1, V2,...,Vr span the space of all rows of A, 
aj: v = 0 for all v perpendicular to v1,v2,..., Vr. Thus, for each row aj, Y (aj: vi)? = 
¿=1 
|a;|?. Summing over all rows j, 
2727205) av = DAs = > 014) 
= i=1 i=l 


j=l i=1 i=1 j=l 


n n d 
But >> |a|? = Y) X as, the sum of squares of all the entries of A. Thus, the sum of 
j=l j=l k=l 

squares of the singular values of A is indeed the square of the “whole content of A”, i.e., 
the sum of squares of all the entries. There is an important norm associated with this 


quantity, the Frobenius norm of A, denoted || A||- defined as 


Alle = [Do ae. 
j,k 


Lemma 3.2 For any matrix A, the sum of squares of the singular values equals the square 
of the Frobenius norm. That is, Y o? (A) = ||A||%. 


Proof: By the preceding discussion. A 
The vectors V1, V2,...,Vy are called the right-singular vectors. The vectors Av; form 
a fundamental set of vectors and we normalize them to length one by 
1 
uü; = ——— Avi. 
o;(A) 
Later we will show that u; similarly maximizes |uT A| over all u perpendicular to u4, .. . , U;—1. 


These u, are called the left-singular vectors. Clearly, the right-singular vectors are orthog- 
onal by definition. We will show later that the left-singular vectors are also orthogonal. 
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3.4 Singular Value Decomposition (SVD) 


Let A be an n x d matrix with singular vectors vj, V2,...,V, and corresponding 
singular values 0,,02,...,0y. The left-singular vectors of A are u;i = 5 Avi where 0; Uj; is 
a vector whose coordinates correspond to the projections of the rows of A onto vi. Each 
o¡ujv; is a rank one matrix whose rows are the “v; components” of the rows of A, i.e., the 
projections of the rows of A in the v; direction. We will prove that A can be decomposed 
into a sum of rank one matrices as 


T 

TE 

A= > O;UjV; . 
i=1 


Geometrically, each point is decomposed in A into its components along each of the r 
orthogonal directions given by the v;. We will also prove this algebraically. We begin 
with a simple lemma that two matrices A and B are identical if Av = Bv for all v. 


Lemma 3.3 Matrices A and B are identical if and only if for all vectors v, Av = Bv. 


Proof: Clearly, if A = B then Av = Bv for all v. For the converse, suppose that 
Av = Bv for all v. Let e; be the vector that is all zeros except for the i” component 
which has value one. Now Ae; is the i column of A and thus A = B if for each i, 


Ae; = Be. A 
Theorem 3.4 Let A be ann x d matrix with right-singular vectors V1,Va,..., Vr, left- 
singular vectors U],Uz,..., Uy, and corresponding singular values 01,09,...,0,. Then 


T 

T 

A= > O7;UV; . 
i=1 


Proof: We first show that multiplying both A and > o;ujvj by vj results in equality. 
i=l 


r 
> TiUiV; Vj = gjuj = Av; 
i=l 


Since any vector v can be expressed as a linear combination of the singular vectors 


plus a vector perpendicular to the vi, Av = Y c;u;vfv for all v and by Lemma 3.3, 
i=1 


A= doje. A 
i=l 

The decomposition A = X; c;u;v{ is called the singular value decomposition, SVD, 

of A. We can rewrite this equation in matrix notation as A = UDVT where u; is the i” 

column of U, vf is the i” row of VT, and D is a diagonal matrix with o; as the ¿% entry 
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Figure 3.2: The SVD decomposition of an n x d matrix. 


on its diagonal. For any matrix A, the sequence of singular values is unique and if the 
singular values are all distinct, then the sequence of singular vectors is unique up to signs. 
However, when some set of singular values are equal, the corresponding singular vectors 
span some subspace. Any set of orthonormal vectors spanning this subspace can be used 
as the singular vectors. 


3.5 Best Rank-k Approximations 


Let A be an n x d matrix and think of the rows of A as n points in d-dimensional 


space. Let 
T 
T 
A= > o Uv; 
i=1 


be the SVD of A. For k € {1,2,...,r}, let 


k 
A, = > OjUGVi 
i=1 


be the sum truncated after k terms. It is clear that A; has rank k. We show that A; 
is the best rank k approximation to A, where error is measured in the Frobenius norm. 
Geometrically, this says that v1,...,Vx define the k-dimensional space minimizing the 
sum of squared distances of the points to the space. To see why, we need the following 
lemma. 


Lemma 3.5 The rows of A, are the projections of the rows of A onto the subspace Vg 
spanned by the first k singular vectors of A. 


Proof: Let a be an arbitrary row vector. Since the v; are orthonormal, the projection 
of the vector a onto Vp is given by San (a - v¡)v;”. Thus, the matrix whose rows are 
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the projections of the rows of A onto V; is given by a Aviv}. This last expression 


simplifies to 
k k 
) Aviv? = ) OMV, = Ax. 
i=1 i=1 


Theorem 3.6 For any matrix B of rank at most k 
|A — Arlle < 114— Bl y 


Proof: Let B minimize ||A — B||} among all rank k or less matrices. Let V be the space 
spanned by the rows of B. The dimension of V is at most k. Since B minimizes ||A — Bll%, 
it must be that each row of B is the projection of the corresponding row of A onto V: 
Otherwise replace the row of B with the projection of the corresponding row of A onto 
V. This still keeps the row space of B contained in V and hence the rank of B is still at 
most k. But it reduces [4 — Bl|[%, contradicting the minimality of ||A — B||p. 


Since each row of B is the projection of the corresponding row of A, it follows that 
|| A — Bl|% is the sum of squared distances of rows of A to V. Since A, minimizes the 
sum of squared distance of rows of A to any k-dimensional subspace, from Theorem 3.1, 
it follows that ||A — Agl| > < ||A — Bll y. A 


In addition to the Frobenius norm, there is another matrix norm of interest. Consider 
an n x d matrix A and a large number of vectors where for each vector x we wish to 
compute Ax. It takes time O(nd) to compute each product Ax but if we approximate 
A by Ay = Sa o,u;v;_ and approximate Ax by A;x it requires only k dot products 
of d-dimensional vectors, followed by a sum of k n-dimensional vectors, and takes time 
O(kd + kn), which is a win provided k << min(d,n). How is the error measured? Since x 
is unknown, the approximation needs to be good for every x. So we take the maximum 
over all x of |(A, — A)x|. Since this would be infinite if |x| could grow without bound, 
we restrict the maximum to |x| < 1. Formally, we define a new norm of a matrix A by 


|| Al|2 = max | Ax]. 
|x|<1 
This is called the 2-norm or the spectral norm. Note that it equals 0,(4). 


As an application consider a large database of documents that form rows of an n x d 
matrix A. There are d terms and each document is a d-dimensional vector with one 
component for each term, which is the number of occurrences of the term in the document. 
We are allowed to “preprocess” A. After the preprocessing, we receive queries. Each 
query x is an d-dimensional vector which specifies how important each term is to the 
query. The desired answer is an n-dimensional vector which gives the similarity (dot 
product) of the query to each document in the database, namely Ax, the “matrix-vector” 
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product. Query time is to be much less than preprocessing time, since the idea is that we 
need to answer many queries for the same database. There are many other applications 
where one performs many matrix vector products with the same matrix. This technique 
is applicable to these situations as well. 


3.6 Left Singular Vectors 


The left singular vectors are also pairwise orthogonal. Intuitively if uj and uj, i < j, were 
not orthogonal, one would suspect that the right singular vector vj had a component of vi 
which would contradict that vj and v; were orthogonal. Let 7 be the smallest integer such 
that u; is not orthogonal to all other uj. Then to prove that u; and uj are orthogonal, 
we add a small component of vj to vi, normalize the result to be a unit vector 


, Vi T Ev; 





|v; + ev;| 
and show that |Avj| > |Avi|, a contradiction. 
Theorem 3.7 The left singular vectors are pairwise orthogonal. 


Proof: Let 7 be the smallest integer such that u; is not orthogonal to some other uj. 
Without loss of generality assume that uj uj = ô > 0. If uj’u; < 0 then just replace u; 
with —u;¡. Clearly j > i since 7 was selected to be the smallest such index. For e > 0, let 


, Vi T Ev; 


Ni = 





[vi + eva] 
Notice that vj is a unit-length vector. 
04; + £0,0; 
Axia ee ARAS 
f vi +e? 
has length at least as large as its component along u; which is 
T (= + E0;Uj 


A vl+e? 


for sufficiently small e, a contradiction since v; +evj is orthogonal to v1, V2, ..., Vi_1 Since 
j > ¡and g; is defined to be the maximum of |Av| over such vectors. A 


) > (a; + €0;6) (1 = 5) > Oi — To + e0j0 — 2550 > Oi, 


Next we prove that A; is the best rank k, 2-norm approximation to A. We first show 
that the square of the 2-norm of A — A, is the square of the (k + 1)" singular value of A. 
This is essentially by definition of Az; that is, Az represents the projections of the rows in 
A onto the space spanned by the top k singular vectors, and so A — A, is the remaining 
portion of those rows, whose top singular value will be op41- 


Lemma 3.8 [4 — Ag||3 = 02,1. 
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r k 
Proof: Let A = Y o;u;v;" be the singular value decomposition of A. Then A; = » c;u;v;? 
i=l 


i=1 
r 
and A— A, = J` o;uyv;’. Let v be the top singular vector of A — Az. Express v as a 
i=k+1 
r 
linear combination of v1, V2,...,Vr. That is, write v = »' cjv;. Then 
j=l 
r T Tr 
T T 
(A — Ak)v| = ) O¡UVi ) cjvj] = J Ci0iUiVi Vi 
i=k+1 j=1 i=k+1 











T 
= > C;04j 


i=k+1 











since the u; are orthonormal. The v maximizing this last quantity, subject to the con- 

straint that |v|? = 3 c? = 1, occurs when c+, = 1 and the rest of the c; are zero. Thus, 

|A — Arl = 0 TE the lemma. E 
Finally, we prove that A; is the best rank k, 2-norm approximation to A: 

Theorem 3.9 Let A be an n x d matrix. For any matrix B of rank at most k 


lA -= Axl], < 14 = Bl2- 


Proof: If A is of rank k or less, the theorem is obviously true since ||A-— Agl|, = 0. 
Assume that A is of rank greater than k. By Lemma 3.8, 4 — All} = ok,- The null 
space of B, the set of vectors v such that Bv = 0, has dimension at least d — k. Let 
Vi, V2,---,Vk+1 be the first k + 1 singular vectors of A. By a dimension argument, it 
follows that there exists a z 4 0 in 


Null (B) N Span (v1,Va,..., Vk+1)- 
Scale z to be of length one. 


[4 — Bll, > 1(4— B)2[”. 








Since Bz =0, 
|A- Bll, > |Azļ’ . 
Since z is in the Span {v1, V2, ... , Vk+1) 
n 2 n k+1 ; k+1 : 
| Az|? = y oy; z| = y o? (vitz) = x o? (viz) > of 44 y (Via 2) Se ae 
i=1 i=1 i=1 i=1 
It follows that ||A — BIZ > 0441 proving the theorem. A 
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For a square symmetric matrix A and eigenvector v, Av = Av. We now prove the 
analog for singular values and vectors we discussed in the introduction. 


Lemma 3.10 (Analog of eigenvalues and eigenvectors) 
Av; = 0;u; and Alu; = 0;vi. 


Proof: The first equation follows from the definition of left singular vectors. For the 
second, note that from the SVD, we get ATu; = Des oj;Vjuj’ ui, where since the uj are 
orthonormal, all terms in the summation are zero except for 7 = i. A 


3.7 Power Method for Singular Value Decomposition 


Computing the singular value decomposition is an important branch of numerical 
analysis in which there have been many sophisticated developments over a long period of 
time. The reader is referred to numerical analysis texts for more details. Here we present 
an “in-principle” method to establish that the approximate SVD of a matrix A can be 
computed in polynomial time. The method we present, called the power method, is simple 
and is in fact the conceptual starting point for many algorithms. Let A be a matrix whose 
SVD is o o,;u;v;_. We wish to work with a matrix that is square and symmetric. Let 
B = ATA. By direct multiplication, using the orthogonality of the u;’s that was proved 
in Theorem 3.7, 


= ATA = (= O;Viu; r) (= OjUjV; r) 
= X ciojvi(u] - Uj) vi = Soo} Vivi. 
ij i 


The matrix B is square aug symmetric, and has the same left and right-singular vectors. 
In particular, Bv; = (90, o?viv} )vj = 05V5, so vj is an eigenvector of B with eigenvalue 
a?. If A is itself square and space: it will have the same right and left-singular vec- 
tors, namely A = > 0¡v;¡v¡" and computing B is unnecessary. 


Now consider computing B?. 
a (Eno? ) (Sete r) - 2 Vi 


When i Æ j, the dot product vj vj is zero by orthogonality.? Thus, B? = » ofvivie. 


computing the kt” power of B, all the cross product terms are zero and 


r 

k ` 2k T 

= Oi ViVi. 
i=1 


is a matrix and is not zero even for i Æ j. 





“The “outer product” viv; 
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If 9, > 0», then the first term in the summation dominates, so B* > 0y,v1*. This 
means a close estimate to vı can be computed by simply taking the first column of B* 
and normalizing it to a unit vector. 


3.7.1 A Faster Method 


A problem with the above method is that A may be a very large, sparse matrix, say a 
108 x 108 matrix with 10° non-zero entries. Sparse matrices are often represented by just 
a list of non-zero entries, say a list of triples of the form (i, j, aij). Though A is sparse, B 
need not be and in the worse case may have all 1016 entries non-zero!” and it is then impos- 
sible to even write down B, let alone compute the product B?. Even if A is moderate in 
size, computing matrix products is costly in time. Thus, a more efficient method is needed. 


Instead of computing B*, select a random vector x and compute the product B*x. 
The vector x can be expressed in terms of the singular vectors of B augmented to a full 


d 
orthonormal basis as x = > c;v;. Then 
i=1 


Brx = (0% kyiv? )( 5 ei) = o? kevi. 


Normalizing the resulting vector yields vı, the first singular vector of A. The way B*x 
is computed is by a series of matrix vector products, instead of matrix products. Bx = 
ATA... AT Ax, which can be computed right-to-left. This consists of 2k vector times 
sparse matrix multiplications. 


To compute k singular vectors, one selects a random vector r and finds an orthonormal 
basis for the space spanned by r, Ar,..., A*~!r. Then compute A times each of the basis 
vectors, and find an orthonormal basis for the space spanned by the resulting vectors. 
Intuitively, one has applied A to a subspace rather than a single vector. One repeat- 
edly applies A to the subspace, calculating an orthonormal basis after each application 
to prevent the subspace collapsing to the one dimensional subspace spanned by the first 
singular vector. The process quickly converges to the first k singular vectors. 


An issue occurs if there is no significant gap between the first and second singular 
values of a matrix. Take for example the case when there is a tie for the first singular 
vector and cı = 03. Then, the above argument fails. We will overcome this hurdle. 
Theorem 3.11 below states that even with ties, the power method converges to some 
vector in the span of those singular vectors corresponding to the “nearly highest” singular 
values. The theorem assumes it is given a vector x which has a component of magnitude 
at least ô along the first right singular vector vı of A. We will see in Lemma 3.12 that a 
random vector satisfies this condition with fairly high probability. 





10E.g., suppose each entry in the first row of A is non-zero and the rest of A is zero. 
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Theorem 3.11 Let A be annxd matrix and x a unit length vector in R? with |x7v,| > 9, 
where 0 > 0. Let V be the space spanned by the right singular vectors of A corresponding 
to singular values greater than (1 —e)0,. Let w be the unit vector after k = Qe) 
iterations of the power method, namely, 


(ATA)* x 
w= =. 
(ATA) x| 
Then w has a component of at most e perpendicular to V. 

Proof: Let A 
A= y Tivi 
i=1 
be the SVD of A. If the rank of A is less than d, then for convenience complete 
{V1, V2,-..V,} into an orthonormal basis {vj, V2,...Vva} of d-space. Write x in the basis 


of the v;’s as 
d 
x= J CiVi. 
i=1 


d d 
Since (ATA)! = Y o?*vivi, it follows that (ATA)"x = > o?"c;vi. By hypothesis, 
i=l 


i=1 
[cr | > ô. 


Suppose that 01,09,...,@m are the singular values of A that are greater than or equal 
to (1 —¢€)o, and that 0m+1,..., 04 are the singular values that are less than (1 — e) oy. 
Now 

¡(ATA Fx]? = En ofc; = Soothe > Sate > oF. 








The component of |(A7.A)*x|? perpendicular to the space V is 





d 
y ote < ( se) g 4k ` É < (1 j a 
i=m+1 i=m+1 
since ya 1G = |x| = 1. Thus, the component of w perpendicular to V has squared 
length at most e and so its length is at most 
L 2y2k 2k L 2\2k —2ke 
eee ee) Ae E 
erg ô T å 
since k = a E 
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Lemma 3.12 Let y € R” be a random vector with the unit variance spherical Gaussian 
as its probability density. Normalize y to be a unit length vector by setting x = y/|y|. Let 
v be any unit length vector. Then 


1 1 
Prob | |x? v| < ) < ESO: 

( is 20Vd/ ~ 10 
Proof: Proving for the unit length vector x that Prob (Ix"v| < aa) < 54+ 3e-U is 
equivalent to proving for the unnormalized vector y that Prob(|y| > 2Vd) < 3e7%/% and 
Prob(|y7v| < $) < 1/10. That Prob(|y| > 2Vd) is at most 3e~“/° follows from Theorem 
(2.9) with Vd substituted for 8. The probability that ly’v| < a is at most 1/10 follows 
from the fact that y’v is a random, zero mean, unit variance Gaussian with density is at 
most 1/27 < 1/2 in the interval [—1/10, 1/10], so the integral of the Gaussian over the 
interval is at most 1/10. E 





3.8 Singular Vectors and Eigenvectors 


For a square matrix B, if Bx = Ax, then x is an eigenvector of B and A is the corre- 
sponding eigenvalue. We saw in Section 3.7, if B = ATA, then the right singular vectors 
vj of A are eigenvectors of B with eigenvalues 07. The same argument shows that the left 
singular vectors uj of A are eigenvectors of AA” with eigenvalues 0%. 


The matrix B = ATA has the property that for any vector x, x" Bx > 0. This is 
because B = Y o?vivi" and for any x, x’ vivix = (xv; > 0. A matrix B with 
the property that x’ Bx > 0 for all x is called positive semi-definite. Every matrix of 
the form ATA is positive semi-definite. In the other direction, any positive semi-definite 
matrix B can be decomposed into a product ATA, and so its eigenvalue decomposition 
can be obtained from the singular value decomposition of A. The interested reader should 
consult a linear algebra book. 


3.9 Applications of Singular Value Decomposition 
3.9.1 Centering Data 


Singular value decomposition is used in many applications and for some of these ap- 
plications it is essential to first center the data by subtracting the centroid of the data 
from each data point.!! If you are interested in the statistics of the data and how it varies 
in relationship to its mean, then you would center the data. On the other hand, if you 
are interested in finding the best low rank approximation to a matrix, then you do not 
center the data. The issue is whether you are finding the best fitting subspace or the best 
fitting affine space. In the latter case you first center the data and then find the best 
fitting subspace. See Figure 3.3. 





"The centroid of a set of points is the coordinate-wise average of the points. 
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Ela? 


Figure 3.3: If one wants statistical information relative to the mean of the data, one 
needs to center the data. If one wants the best low rank approximation, one would not 
center the data. 














AN 


We first show that the line minimizing the sum of squared distances to a set of points, 
if not restricted to go through the origin, must pass through the centroid of the points. 
This implies that if the centroid is subtracted from each data point, such a line will pass 
through the origin. The best fit line can be generalized to k dimensional “planes”. The 
operation of subtracting the centroid from all data points is useful in other contexts as 
well. We give it the name “centering data”. 


Lemma 3.13 The best-fit line (minimizing the sum of perpendicular distances squared) 
of a set of data points must pass through the centroid of the points. 


Proof: Subtract the centroid from each data point so that the centroid is O. After 
centering the data let £ be the best-fit line and assume for contradiction that £ does 
not pass through the origin. The line £ can be written as {a + Av|A € R}, where a is 
the closest point to O on £ and v is a unit length vector in the direction of £, which is 
perpendicular to a. For a data point aj, let dist(a;, £) denote its perpendicular distance to 
l. By the Pythagorean theorem, we have |a; — a|? = dist(a;, 0)? + (v-a;)?, or equivalently, 
dist(a;, 0)? = ja; — al? — (v- aj)”. Summing over all data points: 


n n 


X dist(ai, 0)? = y (Ja; — al? — (v - a;)?) = ` (Jail? + Jal? — 2a; - a — (v - a¡)”) 
i=l i=l i=l 


= yz Jai]? + njal? — 2a - (= 3 = S (v aj)" = 2 Jail? + nļal? — S (v aj)”, 


i=1 i 


where we used the fact that since the centroid is 0, $`, a; = 0. The above expression is 
minimized when a = 0, so the line (' = {Av : A € R} through the origin is a better fit 
than £, contradicting / being the best-fit line. E 


A statement analogous to Lemma 3.13 holds for higher dimensional objects. Define 
an affine space as a subspace translated by a vector. So an affine space is a set of the 
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form 


k 
{Vo + X cvile, c2, -Ck E R} 


i=1 


Here, vo is the translation and vy, V2,..., Vk form an orthonormal basis for the subspace. 


Lemma 3.14 The k dimensional affine space which minimizes the sum of squared per- 
pendicular distances to the data points must pass through the centroid of the points. 


Proof: We only give a brief idea of the proof, which is similar to the previous lemma. 
Instead of (v - a;)?, we will now have Dily - aj)”, where the vj, j =1,2,...,k are an 
orthonormal basis of the subspace through the origin parallel to the affine space. E 


3.9.2 Principal Component Analysis 


The traditional use of SVD is in Principal Component Analysis (PCA). PCA is il- 
lustrated by a movie recommendation setting where there are n customers and d movies. 
Let matrix A with elements a;; represent the amount that customer 7 likes movie j. One 
hypothesizes that there are only k underlying basic factors that determine how much a 
given customer will like a given movie, where k is much smaller than n or d. For example, 
these could be the amount of comedy, drama, and action, the novelty of the story, etc. 
Each movie can be described as a k-dimensional vector indicating how much of these ba- 
sic factors the movie has, and each customer can be described as a k-dimensional vector 
indicating how important each of these basic factors is to that customer. The dot-product 
of these two vectors is hypothesized to determine how much that customer will like that 
movie. In particular, this means that the n x d matrix A can be expressed as the product 
of ann x k matrix U describing the customers and a k x d matrix V describing the movies. 
Finding the best rank k approximation Az by SVD gives such a U and V. One twist is 
that A may not be exactly equal to UV, in which case A — UV is treated as noise. An- 
other issue is that SVD gives a factorization with negative entries. Non-negative matrix 
factorization (NMF) is more appropriate in some contexts where we want to keep entries 
non-negative. NMF is discussed in Chapter 9 


In the above setting, A was available fully and we wished to find U and V to identify 
the basic factors. However, in a case such as movie recommendations, each customer may 
have seen only a small fraction of the movies, so it may be more natural to assume that we 
are given just a few elements of A and wish to estimate A. If A was an arbitrary matrix 
of size n x d, this would require Q(nd) pieces of information and cannot be done with a 
few entries. But again hypothesize that A was a small rank matrix with added noise. If 
now we also assume that the given entries are randomly drawn according to some known 
distribution, then there is a possibility that SVD can be used to estimate the whole of A. 
This area is called collaborative filtering and one of its uses is to recommend movies or to 
target an ad to a customer based on one or two purchases. We do not describe it here. 
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factors 


movies 


customers A = U V 


Figure 3.4: Customer-movie data 


3.9.3 Clustering a Mixture of Spherical Gaussians 


Clustering is the task of partitioning a set of points into k subsets or clusters where 
each cluster consists of nearby points. Different definitions of the quality of a clustering 
lead to different solutions. Clustering is an important area which we will study in detail 
in Chapter 7. Here we will see how to solve a particular clustering problem using singular 
value decomposition. 


Mathematical formulations of clustering tend to have the property that finding the 
highest quality solution to a given set of data is NP-hard. One way around this is to 
assume stochastic models of input data and devise algorithms to cluster data generated by 
such models. Mixture models are a very important class of stochastic models. A mixture 
is a probability density or distribution that is the weighted sum of simple component 
probability densities. It is of the form 





f = wip, + Wapa + `+- + WkPk, 


where p1, p2, ..., pp are the basic probability densities and w1, w2, ..., wg are positive real 
numbers called mixture weights that add up to one. Clearly, f is a probability density 
and integrates to one. 


The model fitting problem is to fit a mixture of k basic densities to n independent, 
identically distributed samples, each sample drawn according to the same mixture dis- 
tribution f. The class of basic densities is known, but various parameters such as their 
means and the component weights of the mixture are not. Here, we deal with the case 
where the basic densities are all spherical Gaussians. There are two equivalent ways of 
thinking of the hidden sample generation process when only the samples are given: 


1. Pick each sample according to the density f on R?. 
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2. Pick a random 7 from {1,2,...,4} where probability of picking i is w;. Then, pick 
a sample according to the density p;. 


One approach to the model-fitting problem is to break it into two subproblems: 


1. First, cluster the set of samples into k clusters C¡,Ca,..., Cp, where C; is the set of 
samples generated according to p; (see (2) above) by the hidden generation process. 


2. Then fit a single Gaussian distribution to each cluster of sample points. 


The second problem is relatively easier and indeed we saw the solution in Chapter 
2, where we showed that taking the empirical mean (the mean of the sample) and the 
empirical standard deviation gives us the best-fit Gaussian. The first problem is harder 
and this is what we discuss here. 


If the component Gaussians in the mixture have their centers very close together, then 
the clustering problem is unresolvable. In the limiting case where a pair of component 
densities are the same, there is no way to distinguish between them. What condition on 
the inter-center separation will guarantee unambiguous clustering? First, by looking at 
1-dimensional examples, it is clear that this separation should be measured in units of the 
standard deviation, since the density is a function of the number of standard deviation 
from the mean. In one dimension, if two Gaussians have inter-center separation at least 
six times the maximum of their standard deviations, then they hardly overlap. This is 
summarized in the question: How many standard deviations apart are the means? In one 
dimension, if the answer is at least six, we can easily tell the Gaussians apart. What is 
the analog of this in higher dimensions? 


We discussed in Chapter 2 distances between two sample points from the same Gaus- 
sian as well the distance between two sample points from two different Gaussians. Recall 
from that discussion that if 


e If x and y are two independent samples from the same spherical Gaussian with 
standard deviation!” ø then 


x-y? = 2(vd Æ O(1))*o?. 


e If x and y are samples from different spherical Gaussians each of standard deviation 
g and means separated by distance A, then 


Ix — y? = 2(Vd + O(1))°o? + A. 


To ensure that points from the same Gaussian are closer to each other than points from 
different Gaussians, we need 


2(Vd — O(1))°o? + A? > 2(Vd + 0(1)) 0. 


Since a spherical Gaussian has the same standard deviation in every direction, we call it the standard 
deviation of the Gaussian. 
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Expanding the squares, the high order term 2d cancels and we need that 
As Edna, 


for some constant c. While this was not a completely rigorous argument, it can be used to 
show that a distance based clustering approach (see Chapter 2 for an example) requires an 
inter-mean separation of at least cd'/* standard deviations to succeed, thus unfortunately 
not keeping with mnemonic of a constant number of standard deviations separation of 
the means. Here, indeed, we will show that (1) standard deviations suffice provided the 
number k of Gaussians is O(1). 


The central idea is the following. Suppose we can find the subspace spanned by the 
k centers and project the sample points to this subspace. The projection of a spherical 
Gaussian with standard deviation g remains a spherical Gaussian with standard deviation 
g (Lemma 3.15). In the projection, the inter-center separation remains the same. So in 
the projection, the Gaussians are distinct provided the inter-center separation in the whole 
space is at least ck!/* ø which is less than cd'/* o for k < d. Interestingly, we will see that 
the subspace spanned by the k-centers is essentially the best-fit k-dimensional subspace 
that can be found by singular value decomposition. 


Lemma 3.15 Suppose p is a d-dimensional spherical Gaussian with center u and stan- 
dard deviation a. The density of p projected onto a k-dimensional subspace V is a spherical 
Gaussian with the same standard deviation. 


Proof: Rotate the coordinate system so V is spanned by the first k coordinate vectors. 
The Gaussian remains spherical with standard deviation o although the coordinates of 








its center have changed. For a point x = (21, £2,..., Za), we will use the notation x’ = 
(£1, £2,... £k) and x” = (k41, Uk42)---, Un). The density of the projected Gaussian at 
the point (£1, 22,...,2%) is 
|x’ — pa! |? [xp]? 2 Ix"— pur? 
ce. 202 | e 22 dx" =de 2? 5 


x” 


This implies the lemma. E 


We now show that the top k singular vectors produced by the SVD span the space of 
the k centers. First, we extend the notion of best fit to probability distributions. Then 
we show that for a single spherical Gaussian whose center is not the origin, the best fit 
1-dimensional subspace is the line though the center of the Gaussian and the origin. Next, 
we show that the best fit k-dimensional subspace for a single Gaussian whose center is not 
the origin is any k-dimensional subspace containing the line through the Gaussian's center 
and the origin. Finally, for k spherical Gaussians, the best fit k-dimensional subspace is 
the subspace containing their centers. Thus, the SVD finds the subspace that contains 
the centers. 
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1. The best fit 1-dimension subspace 
to a spherical Gaussian is the line 
through its center and the origin. 


2. Any k-dimensional subspace contain- 
ing the line is a best fit k-dimensional 
subspace for the Gaussian. 


3. The best fit k-dimensional subspace 
for k spherical Gaussians is the sub- 
space containing their centers. 


Figure 3.5: Best fit subspace to a spherical Gaussian. 


Recall that for a set of points, the best-fit line is the line passing through the origin 
that maximizes the sum of squared lengths of the projections of the points onto the line. 
We extend this definition to probability densities instead of a set of points. 


Definition 3.1 /fp is a probability density in d space, the best fit line for p is the line in 
the vı direction where 
vı = arg n E [(v"x)’] : 
v|=1 x~p 


For a spherical Gaussian centered at the origin, it is easy to see that any line passing 
through the origin is a best fit line. Our next lemma shows that the best fit line for a 
spherical Gaussian centered at pz Æ 0 is the line passing through p and the origin. 


Lemma 3.16 Let the probability density p be a spherical Gaussian with center u # 0. 


The unique best fit 1-dimensional subspace is the line passing through u and the origin. 
If =0, then any line through the origin is a best-fit line. 
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Proof: For a randomly chosen x (according to p) and a fixed unit length vector v, 
E |(v’x)’] = E | vi (x — p) +v"p)"| 


a EA 
(0 29) +2 (vu) (V7 (x — 1) + (wT) 
( 
( 


I 
eS 


+ 2 (vu) E [v7 (x—u)] + (py 
(6-1) + (171) 


= 0? + (vu) 


ll 
< 
3 
E) 
| 
= 
= 
EY 
| 





where the fourth line follows from the fact that E[v’(x — y)] = 0, and the fifth line 
follows from the fact that E[(v?(x — pu) )?] is the variance in the direction v. The best fit 
line v maximizes Ex. p[(v*x)?] and therefore maximizes (vT u)’. This is maximized when 
v is aligned with the center . To see uniqueness, just note that if u 4 0, then vtm is 


strictly less when v is not aligned with the center. E 


We now extend Definition 3.1 to k-dimensional subspaces. 


Definition 3.2 If p is a probability density in d-space then the best-fit k-dimensional 
subspace Vp is 
V; = argmax E (Iproj(x, MI : 
x~p 
dim(V)=k 


where proj(x, V) is the orthogonal projection of x onto V. A 


Lemma 3.17 For a spherical Gaussian with center u, a k-dimensional subspace is a best 
fit subspace if and only if it contains p. 


Proof: If u = 0, then by symmetry any k-dimensional subspace is a best-fit subspace. If 
u # 0, then, the best-fit line must pass through u by Lemma 3.16. Now, as in the greedy 
algorithm for finding subsequent singular vectors, we would project perpendicular to the 
first singular vector. But after the projection, the mean of the Gaussian becomes 0 and 
any vectors will do as subsequent best-fit directions. A 


This leads to the following theorem. 


Theorem 3.18 /f p is a mixture of k spherical Gaussians, then the best fit k-dimensional 
subspace contains the centers. In particular, if the means of the Gaussians are linearly 
independent, the space spanned by them is the unique best-fit k dimensional subspace. 


Proof: Let p be the mixture w,p,+W2P2+*-**+Wxpx. Let V be any subspace of dimension 
k or less. Then, 


E (Iproj( x,V)| -Dug _([proj( x,V)| 2) 
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If V contains the centers of the densities p;, by Lemma 3.17, each term in the summation 
is individually maximized, which implies the entire summation is maximized, proving the 
theorem. E 


For an infinite set of points drawn according to the mixture, the k-dimensional SVD 
subspace gives exactly the space of the centers. In reality, we have only a large number 
of samples drawn according to the mixture. However, it is intuitively clear that as the 
number of samples increases, the set of sample points will approximate the probability 
density and so the SVD subspace of the sample will be close to the space spanned by 
the centers. The details of how close it gets as a function of the number of samples are 
technical and we do not carry this out here. 


3.9.4 Ranking Documents and Web Pages 


An important task for a document collection is to rank the documents according to 
their intrinsic relevance to the collection. A good candidate definition of “intrinsic rele- 
vance” is a document's projection onto the best-fit direction for that collection, namely the 
top left-singular vector of the term-document matrix. An intuitive reason for this is that 
this direction has the maximum sum of squared projections of the collection and so can be 
thought of as a synthetic term-document vector best representing the document collection. 


Ranking in order of the projection of each document’s term vector along the best fit 
direction has a nice interpretation in terms of the power method. For this, we consider 
a different example, that of the web with hypertext links. The World Wide Web can 
be represented by a directed graph whose nodes correspond to web pages and directed 
edges to hypertext links between pages. Some web pages, called authorities, are the most 
prominent sources for information on a given topic. Other pages called hubs, are ones 
that identify the authorities on a topic. Authority pages are pointed to by many hub 
pages and hub pages point to many authorities. One is led to what seems like a circular 
definition: a hub is a page that points to many authorities and an authority is a page 
that is pointed to by many hubs. 


One would like to assign hub weights and authority weights to each node of the web. 
If there are n nodes, the hub weights form an n-dimensional vector u and the authority 
weights form an n-dimensional vector v. Suppose A is the adjacency matrix representing 
the directed graph. Here aj; is 1 if there is a hypertext link from page 7 to page j and 0 
otherwise. Given hub vector u, the authority vector v could be computed by the formula 


d 
Uj X > u;¡0;; 
i=1 


since the right hand side is the sum of the hub weights of all the nodes that point to node 
j. In matrix terms, 


v = Alu/|A‘ul. 
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Similarly, given an authority vector v, the hub vector u could be computed by 
u = Av/|Av|. Of course, at the start, we have neither vector. But the above discus- 
sion suggests a power iteration. Start with any v. Set u = Av, then set v = ATu, then 
renormalize and repeat the process. We know from the power method that this converges 
to the left and right-singular vectors. So after sufficiently many iterations, we may use the 
left vector u as the hub weights vector and project each column of A onto this direction 
and rank columns (authorities) in order of this projection. But the projections just form 
the vector ATu which equals a multiple of v. So we can just rank by order of the v). 
This is the basis of an algorithm called the HITS algorithm, which was one of the early 
proposals for ranking web pages. 


A different ranking called pagerank is widely used. It is based on a random walk on 
the graph described above. We will study random walks in detail in Chapter 4. 


3.9.5 An Illustrative Application of SVD 


A deep neural network in which inputs images are classified by category such as cat, 
dog, or car maps an image to an activation space. The dimension of the activation space 
might be 4,000, but the set of cat images might be mapped to a much lower dimensional 
manifold. To determine the dimension of the cat manifold, we could construct a tangent 
subspace at an activation vector for a cat image. However, we only have 1,000 cat images 
and the images are spread far apart in the activation space. We need a large number of 
cat activation vectors close to each original cat activation vector to determine the dimen- 
sion of the tangent subspace. To do this we want to slightly modify each cat image to 
get many images that are close to the original. One way to do this is to do a singular 
value decomposition of an image and zero out a few very small singular values. If the 
image is 1,000 by 1,000 there will be a 1,000 singular values. The smallest 100 will be 
essentially zero and zeroing out a subset of them should not change the image much and 
produce images whose activation vectors are very close. Since there are Cia subsets of 
ten singular values, we can generate say 10,000 such images by zeroing out ten singular 
values. Given the corresponding activation vectors, we can form a matrix of activation 
vectors and determine the rank of the matrix which should give the dimension of the 
tangent subspace to the original cat activation vector. 


To determine the rank of the matrix of 10,000 activation vectors, we again do a singular 
value decomposition. To determine the actual rank, we need to determine a cutoff point 
below which we conclude the remaining singular values are noise. We might consider a 
sufficient number of the largest singular values so that their sum of squares is 95% of the 
square of the Frobenius norm of the matrix or look to see where there is a sharp drop in 
the singular values. 
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3.9.6 An Application of SVD to a Discrete Optimization Problem 


In clustering a mixture of Gaussians, SVD was used as a dimension reduction tech- 
nique. It found a k-dimensional subspace (the space of centers) of a d-dimensional space 
and made the Gaussian clustering problem easier by projecting the data to the subspace. 
Here, instead of fitting a model to data, we consider an optimization problem where ap- 
plying dimension reduction makes the problem easier. The use of SVD to solve discrete 
optimization problems is a relatively new subject with many applications. We start with 
an important NP-hard problem, the maximum cut problem for a directed graph G(V, E). 


The maximum cut problem is to partition the nodes of an n-node directed graph into 
two subsets S and S so that the number of edges from S to S is maximized. Let A be 
the adjacency matrix of the graph. With each vertex i, associate an indicator variable zx;. 
The variable x; will be set to 1 for i € S and 0 for i € S. The vector x = (21, %2,..., En) 
is unknown and we are trying to find it or equivalently the cut, so as to maximize the 
number of edges across the cut. The number of edges across the cut is precisely 


` DL = xj)aij. 
UN) 


Thus, the maximum cut problem can be posed as the optimization problem 


Maximize >> x;(1 — x;Ja¡¿ subject to x; € {0,1}. 
1,3 


In matrix notation, 
> ai(1 — x;)a,; =x" A(1—x), 
tJ 
where 1 denotes the vector of all 1’s . So, the problem can be restated as 
Maximize x’ A(1—x) subject to x; € {0,1}. (3.1) 


This problem is NP-hard. However we will see that for dense graphs, that is, graphs 
with Q(n?) edges and therefore whose optimal solution has size Q(n?),% we can use the 
SVD to find a near optimal solution in polynomial time. To do so we will begin by 
computing the SVD of A and replacing A by A; = ae ojuiv;. in (3.1) to get 


Maximize x’ A,(1 —x) subject to z; € {0,1}. (3.2) 


Note that the matrix A; is no longer a 0-1 adjacency matrix. 


We will show that: 





1. For each 0-1 vector x, x? Az (1 — x) and x7 A(1 — x) differ by at most a Thus, 


the maxima in (3.1) and (3.2) differ by at most this amount. 





13 Any graph of m edges has a cut of size at least m/2. This can be seen by noting that the expected 
size of the cut for a random x € (0, 1)” is exactly m/2. 


64 


2. A near optimal x for (3.2) can be found in time n°“ by exploiting the low rank 


of Ak, which is polynomial time for constant k. By Item 1 this is near optimal for 
2 
(3.1) where near optimal means with additive error of at most Jat 
First, we prove Item 1. Since x and 1 — x are 0-1 n-vectors, each has length at most 
vn. By the definition of the 2-norm, |(A — A,)(1 — x)| < vn||A — Agll2. Now since 
xT(A — Ar)(1 — x) is the dot product of the vector x with the vector (A — A,)(1 — x), 


Ex" (A = Ar) (1 — x)| < ni] A — All». 
By Lemma 3.8, ||A — Ax||2 = o%41(A). The inequalities, 


(+ Doy Sof +09 +++ oh < |A]l7 = Y a <n? 
tj 


imply that la < po and hence || A — Ag|l2 < proving Item 1. 


AF 

Next we focus on Item 2. It is instructive to look at the special case when k=1 and A 
is approximated by the rank one matrix A;. An even more special case when the left and 
right-singular vectors u and v are identical is already NP-hard to solve exactly because 
it subsumes the problem of whether for a set of n integers, {a1,@2,...,@n}, there is a 
partition into two subsets whose sums are equal. However, for that problem, there is 
an efficient dynamic programming algorithm that finds a near-optimal solution. We will 
build on that idea for the general rank k problem. 


For Item 2, we want to maximize )~*_, o¡(x7u)(v; "(1 — x)) over 0-1 vectors x. A 
piece of notation will be useful. For any S C {1,2,...n}, write u¡(S) for the sum of coor- 
dinates of the vector u; corresponding to elements in the set S, that is, u,(S) = jes Uij, 
and similarly for vi. We will find S to maximize ye ciui(S)vi(S) using dynamic pro- 
gramming. 


For a subset S of {1,2,...,n}, define the 2k-dimensional vector 
w(S) = (u (5), vi (5), uz (5), v2(S), eos ,UxK(S), vi(S)). 


If we had the list of all such vectors, we could find Y, oju(S)v;(S) for each of them 
and take the maximum. There are 2” subsets S, but several S could have the same w(S) 
and in that case it suffices to list just one of them. Round each coordinate of each u; to 
the nearest integer multiple of 1. Call the rounded vector ū;. Similarly obtain v;. Let 
w(S) denote the vector (ù (9), v1(S),12(S), va(S),..., UL (S), Vx(S)). We will construct 
a list of all possible values of the vector w(S). Again, if several different S’s lead to the 
same vector w(S), we will keep only one copy on the list. The list will be constructed by 
dynamic programming. For the recursive step, assume we already have a list of all such 
vectors for S C {1,2,...,i} and wish to construct the list for S C {1,2,...,i+1}. Each 
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S C {1,2,...,i} leads to two possible S’ C (1,2,...,¿+1), namely, S and SU {i + 1}. 
In the first case, the vector w(S’) = (ŭ1 (S), ¥1(S) + G1 441, U2 (5), ValS) + Daiyt,...,...). 
In the second case, it is W(S”) = (ù (S) +%1 541, Va(S), U2(S) + U2 441, Val[S),...,...). We 
put in these two vectors for each vector in the previous list. Then, cl we buin - 
i.e., eliminate duplicates. 


Assume that k is constant. Now, we show that the error is at most aS as claimed. 
Since u; and v; are unit length vectors, |u,(S)|, ;|vi(S)| < yn. Also |ú;(S) — u(S)| < 
“= 5 and similarly for v;. To bound the error, we use an elementary fact: if a and b are 
reals with |a|, |b] < M and we estimate a by a’ and b by Y so that |a—a'|,|b—b'| <0 < M, 
then a’b’ is an estimate of ab in the sense 


lab — a’b'| =la(b—0') + b'(a —’)| < lallo — b'| + (Jb] + |b — 0'|)]a — a’ | < 3MO. 


Using this, 


k 
> a(S) vi(S) - Esmas vi(S)| < 3ko,4n/k? < 3n? /k < n?/k, 
i=l 


and this meets the claimed error bound. 


Next, we show that the running time is polynomially bounded. First, |ū:(S)|, |ViCS)| < 
2,/n. Since ŭ;(S) and ¥;(9) are all integer multiples of 1/(nk?), there are at most 2n3/2k? 
possible values of ù;(S) and v;(S) from which it follows that the list of w(S) never gets 
larger than (2n3/?k?)?* which for fixed k is polynomially bounded. 


We summarize what we have accomplished. 


Theorem 3.19 Given a directed graph G(V, E), a cut of size at least the maximum cut 


minus O (2) can be computed in time polynomial in n for any fixed k. 
Note that achieving the same accuracy in time polynomial in n and k would give an 
exact max cut in polynomial time. 


3.10 Bibliographic Notes 


Singular value decomposition is fundamental to numerical analysis and linear algebra. 
There are many texts on these subjects and the interested reader may want to study 
these. A good reference is [GvL96]. The material on clustering a mixture of Gaussians 
in Section 3.9.3 is from [VW02]. Modeling data with a mixture of Gaussians is a stan- 
dard tool in statistics. Several well-known heuristics like the expectation-minimization 
algorithm are used to learn (fit) the mixture model to data. Recently, in theoretical com- 
puter science, there has been modest progress on provable polynomial-time algorithms 
for learning mixtures. Some references are [DS07], [AK05], [AMO5], and [MV10]. The 
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application to the discrete optimization problem is from [FK99]. The section on rank- 
ing documents/webpages is from two influential papers, one on hubs and authorities by 
Jon Kleinberg [Kle99] and the other on pagerank by Page, Brin, Motwani and Winograd 
[BMPW98]. Exercise 3.18 Is from [EVL10]. 
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3.11 Exercises 


Exercise 3.1 (Least squares vertical error) In many experiments one collects the 
value of a parameter at various instances of time. Let y; be the value of the parameter y 
at time x;. Suppose we wish to construct the best linear approximation to the data in the 
sense that we wish to minimize the mean square error. Here error is measured vertically 
rather than perpendicular to the line. Develop formulas form and b to minimize the mean 
square error of the points {(x;, yi) |L < i < n} to the line y = mu + b. 


Exercise 3.2 Given five observed variables, height, weight, age, income, and blood pres- 
sure of n people, how would one find the best least squares fit affine subspace of the form 


ay (height) + az (weight) + az (age) + as (income) + as (blood pressure) = ag 


Here a,,42,...,ag are the unknown parameters. If there is a good best fit 4-dimensional 
affine subspace, then one can think of the points as lying close to a 4-dimensional sheet 
rather than points lying in 5-dimensions. Why might it be better to use the perpendicular 
distance to the affine subspace rather than vertical distance where vertical distance is 
measured along the coordinate axis corresponding to one of the variables? 


Exercise 3.3 Manually find the best fit lines (not subspaces which must contain the ori- 
gin) through the points in the sets below. Best fit means minimize the perpendicular 
distance. Subtract the center of gravity of the points in the set from each of the points 
in the set and find the best fit line for the resulting points. Does the best fit line for the 
original data go through the origin? 


1. (4,4) (6,2) 
2. (4,2) (4,4) (6,2) (6,4) 
3. (3,2.5) (3,5) (5,1) (5,3.5) 


Exercise 3.4 Manually determine the best fit line through the origin for each of the 
following sets of points. Is the best fit line unique? Justify your answer for each of the 
subproblems. 


1. £(0,1),(1,0)) 
2. {(0,1),(2,0)} 


Exercise 3.5 Manually find the left and right-singular vectors, the singular values, and 
the SVD decomposition of the matrices in Figure 3.6. 


Exercise 3.6 Let A be a square n x n matrix whose rows are orthonormal. Prove that 
the columns of A are orthonormal. 
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(0,3) ° 0 2 
1 1 2 0 
M=|03 (0,2) (3,1) M= 13 
1,1 3,1 
(1,1) 3 0 > a 
(3,0) 
(2,0) 
Figure 3.6 a Figure 3.6 b 


Figure 3.6: SVD problem 


Exercise 3.7 Suppose A is an xn matrix with block diagonal structure with k equal size 
blocks where all entries of the 1” block are a; with as > a2 > --- > ap > 0. Show that A 
has exactly k non-zero singular vectors v1, V2,..., Vx where v; has the value (E)? in the 
coordinates corresponding to the i” block and 0 elsewhere. In other words, the singular 
vectors exactly identify the blocks of the diagonal. What happens if a, = ag = --- = ag? 
In the case where the a; are equal, what is the structure of the set of all possible singular 
vectors? 

Hint: By symmetry, the top singular vector’s components must be constant in each block. 


Exercise 3.8 Interpret the first right and left-singular vectors for the document term 
matriz. 
r 
Exercise 3.9 Verify that the sum of r-rank one matrices Y c;xiy;’ can be written as 
¿=1 
XCY7, where the x; are the columns of X, the y; are the columns of Y, and C is a 
diagonal matrix with the constants c; on the diagonal. 


Exercise 3.10 Let X; ciuivi? be the SVD of A. Show that juz A| = 01 and that 
juz A| = ma ju? Al. 
u|=1 


Exercise 3.11 If 0,,02,...,0, are the singular values of A and v;¡,va,...,V, are the 
corresponding right-singular vectors, show that 


1. ATA = So o?vivi? 


i=1 
2. V1, V2,... Vp are eigenvectors of ATA. 


3. Assuming that the eigenvectors of ATA are unique up to multiplicative constants, 
conclude that the singular vectors of A (which by definition must be unit length) are 
unique up to sign. 
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Exercise 3.12 Let Y ciuu? be the singular value decomposition of a rank r matriz A. 
i 


k 

Let Ay = Y cuv? be a rank k approximation to A for some k < r. Express the following 
i=l 

quantities in terms of the singular values {0;,1 < i < r}. 








1. 114x117 
3. ||A — All? 
4. \|A — All? 


Exercise 3.13 If A is a symmetric matrix with distinct singular values, show that the 
left and right singular vectors are the same and that A = V DVT. 


Exercise 3.14 Let A be a matrix. How would you compute 


vı = arg max |Av|? 
|v|=1 


How would you use or modify your algorithm for finding vı to compute the first few 
singular vectors of A. 


Exercise 3.15 Use the power method to compute the singular value decomposition of the 


matrix 
-2 
a=(3a) 


Exercise 3.16 Consider the matrix 


be 2 

-1 2 

An T 2 
Sa: 2 


1. Run the power method starting from x = (i) for k = 3 steps. What does this give 
as an estimate of vı? 


2. What actually are the v;’s, 0;'s, and u;’s? It may be easiest to do this by computing 
the eigenvectors of B= ATA. 


3. Suppose matrix A is a database of restaurant ratings: each row is a person, each 
column is a restaurant, and aij represents how much person i likes restaurant j. 
What might vı represent? What about u? How about the gap 0, — 02? 
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Exercise 3.17 1. Write a program to implement the power method for computing the 


2. 


first singular vector of a matrix. Apply your program to the matrix 


1 2 3- 9 10 
2 3 4- 10 0 
A=| i: : : 
9 10 0 0 0 
10 0 0 0 0 


Modify the power method to find the first four singular vectors of a matrix A as 
follows. Randomly select four vectors and find an orthonormal basis for the space 
spanned by the four vectors. Then multiply each of the basis vectors times A and 
find a new orthonormal basis for the space spanned by the resulting four vectors. 
Apply your method to find the first four singular vectors of matrix A from part 1. 
In Matlab the command orth finds an orthonormal basis for the space spanned by a 
set of vectors. 


Exercise 3.18 


1. 


6. 


Forn = 5,10, ...,25 create random graphs by generating random vectors x = (£1, £2,... 


and y = (Yi, Y2,---;Yn). Create edges (£i, yi) — (Ti+1, Yi+1) fori = 1:n and an edge 
(En, Yn) a (21, Y1)- 


. For each graph create a new graph by selecting the midpoint of each edge for the 


coordinates of the vertices and add edges between vertices corresponding to the mid- 
points of two adjacent edges of the original graph. What happens when you iterate 
this process? It is best to draw the graphs. 


. Repeat the above step but normalize the vectors x and y to have unit length after 


each iteration. What happens? 


. One could implwmwnt the process by matriz multiplication where x(t) and y(t) are 


the vectors at the t iteration. What is the matrix A such that x(t + 1) = Ax(t). 


. What is the first singular vector of A and the first two singular values of A. Does 


this explain what happens and how long the process takes to converge? 


If A is invertible what happens when you run the process backwards. 


Exercise 3.19 A matriz A is positive semi-definite if for all x, x’ Ax > 0. 


1. 


2; 


Let A be a real valued matrix. Prove that B = AAT is positive semi-definite. 


Let A be the adjacency matrix of a graph. The Laplacian of A is L = D — A where 
D is a diagonal matrix whose diagonal entries are the row sums of A. Prove that 
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Tn) 


L is positive semi definite by showing that L = BTB where B is an m-by-n matriz 
with a row for each edge in the graph, a column for each vertex, and we define 


—1 ifi is the endpoint of e with lesser index 
bei = 1 ifi is the endpoint of e with greater index 
O ifi is not an endpoint of e 


Exercise 3.20 Prove that the eigenvalues of a symmetric real valued matrix are real. 


Exercise 3.21 Suppose A is a square invertible matriz and the SVD of A is A = Y ojujvl 
Prove that the inverse of A is Y) +viuf. 


Exercise 3.22 Suppose A is square, but not necessarily invertible and has SVD A = 


3 cju;vf. Let B = z viu. Show that BAx = x for all x in the span of the right- 
i=1 
singular vectors of A. Fo this reason B is sometimes called the pseudo inverse of A and 


can play the role of A7* in many applications. 
Exercise 3.23 


1. For any matriz A, show that op < lle, 





2. Prove that there exists a matrix B of rank at most k such that || A — B||2 < le 





3. Can the 2-norm on the left hand side in (2) be replaced by Frobenius norm? 


Exercise 3.24 Suppose an n x d matrix A is given and you are allowed to preprocess 
A. Then you are given a number of d-dimensional vectors X1,X2,...,Xm and for each of 
these vectors you must find the vector Ax; approximately, in the sense that you must find a 
vector y; satisfying |y;— Ax;| < €||Al|r|x;|. Here e >0 is a given error bound. Describe 
an algorithm that accomplishes this in time O (4) per Xj not counting the preprocessing 
time. Hint: use Exercise 3.23. 


2 y2 2 


Exercise 3.25 Find the values of c; to maximize dal where o? > 0% >... and 


i=l 


Ms 


e == 


t=l 


Exercise 3.26 (Document-Term Matrices): Suppose we have an m x n document- 
term matrix A where each row corresponds to a document and has been normalized to 
length one. Define the “similarity” between two such documents by their dot product. 


1. Consider a “synthetic” document whose sum of squared similarities with all docu- 
ments in the matrix is as high as possible. What is this synthetic document and how 
would you find it? 


2 


2. How does the synthetic document in (1) differ from the center of gravity? 


3. Building on (1), given a positive integer k, find a set of k synthetic documents such 
that the sum of squares of the mk similarities between each document in the matrix 
and each synthetic document is maximized. To avoid the trivial solution of selecting 
k copies of the document in (1), require the k synthetic documents to be orthogonal 
to each other. Relate these synthetic documents to singular vectors. 


4. Suppose that the documents can be partitioned into k subsets (often called clusters), 
where documents in the same cluster are similar and documents in different clusters 
are not very similar. Consider the computational problem of isolating the clusters. 
This is a hard problem in general. But assume that the terms can also be partitioned 
into k clusters so that fori 4 j, no term in the iè cluster occurs in a document 
in the j cluster. If we knew the clusters and arranged the rows and columns in 
them to be contiguous, then the matrix would be a block-diagonal matrix. Of course 
the clusters are not known. By a “block” of the document-term matrix, we mean 
a submatrix with rows corresponding to the i“ cluster of documents and columns 
corresponding to the i’ cluster of terms . We can also partition any n vector into 
blocks. Show that any right-singular vector of the matrix must have the property 
that each of its blocks is a right-singular vector of the corresponding block of the 
document-term matriz. 


5. Suppose now that the k singular values are all distinct. Show how to solve the 
clustering problem. 


Hint: (4) Use the fact that the right-singular vectors must be eigenvectors of ATA. Show 
that ATA is also block-diagonal and use properties of eigenvectors. 


Exercise 3.27 Let u be a fired vector. Show that maximizing x’ uu’ (1 — x) subject to 
x; € {0,1} is equivalent to partitioning the coordinates of u into two subsets where the 
sum of the elements in both subsets are as equal as possible. 


Exercise 3.28 Read in a photo and convert to a matrix. Perform a singular value de- 
composition of the matrix. Reconstruct the photo using only 1,2,4, and 16 singular values. 


1. Print the reconstructed photo. How good is the quality of the reconstructed photo? 
2. What percent of the Frobenius norm is captured in each case? 


Hint: If you use Matlab, the command to read a photo is imread. The types of files that 
can be read are given by imformats. To print the file use imwrite. Print using jpeg format. 
To access the file afterwards you may need to add the file extension .jpg. The command 
imread will read the file in uint8 and you will need to convert to double for the SVD code. 
Afterwards you will need to convert back to uint8 to write the file. If the photo is a color 
photo you will get three matrices for the three colors used. 
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Exercise 3.29 1. Create a 100 x 100 matrix of random numbers between 0 and 1 such 
that each entry is highly correlated with the adjacent entries. Find the SVD of A. 
What fraction of the Frobenius norm of A is captured by the top 10 singular vectors? 
How many singular vectors are required to capture 95% of the Frobenius norm? 


2. Repeat (1) with a 100 x 100 matriz of statistically independent random numbers 
between 0 and 1. 


Exercise 3.30 Show that the running time for the maximum cut algorithm in Section 
3.9.6 can be carried out in time O(n? + poly(n)k*), where poly is some polynomial. 


Exercise 3.31 Let x1,X2,...,Xn be n points in d-dimensional space and let X be the 
n xd matrix whose rows are the n points. Suppose we know only the matrix D of pairwise 
distances between points and not the coordinates of the points themselves. The set of points 
X1,X2,...,Xy giving rise to the distance matrix D is not unique since any translation, 
rotation, or reflection of the coordinate system leaves the distances invariant. Fix the 
origin of the coordinate system so that the centroid of the set of points is at the origin. 
That is, So Xi = 0. 


1. Show that the elements of XXT are given by 


1 ye 18 LY 
xxj = —5 M gt, T a a 


k=1 l=1 
2. Describe an algorithm for determining the matrix X whose rows are the Xi. 


Exercise 3.32 


1. Consider the pairwise distance matrix for twenty US cities given below. Use the 
algorithm of Exercise 3.31 to place the cities on a map of the US. The algorithm is 
called classical multidimensional scaling, cmdscale, in Matlab. Alternatively use the 
pairwise distance matrix of 12 Chinese cities to place the cities on a map of China. 


Note: Any rotation or a mirror image of the map will have the same pairwise 
distances. 


2. Suppose you had airline distances for 50 cities around the world. Could you use 
these distances to construct a 3-dimensional world model? 
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Boston - 
Buffalo 400 
Chicago 851 
Dallas 1551 
Denver 1769 
Houston 1605 
Los Angeles 2596 
Memphis 1137 
Miami 1255 
Minneapolis 1123 
New York 188 
Omaha 1282 
Philadelphia 271 
Phoenix 2300 
Pittsburgh 483 
Saint Louis 1038 
Salt Lake City 2099 
San Francisco 2699 
Seattle 2493 
Washington D.C. 393 


Boston 
Buffalo 
Chicago 
Dallas 
Denver 
Houston 
Los Angeles 
Memphis 
Miami 
Minneapolis 
New York 
Omaha 
Philadelphia 
Phoenix 
Pittsburgh 
Saint Louis 


Salt Lake City 
San Francisco 
Seattle 
Washington D.C. 


N 
Y 


188 
292 
713 
374 
1631 
1420 
2451 
957 
1092 
1018 





1144 
83 
2145 
317 
875 
1972 
2571 
2408 
230 


851 
454 


803 
920 
940 
1745 
482 
1188 
355 
713 
432 
666 
1453 
410 
262 
1260 
1858 
1737 
597 


M 
A 
1282 
883 
432 
586 
488 
794 
1315 
529 
1397 
290 
1144 


1094 
1036 
836 
354 
833 
1429 
1369 
1014 





D 
A 

L 
1551 
1198 
803 


663 

225 
1240 

420 
1111 
862 
1374 
586 
1299 
887 
070 
547 
999 
483 
681 
1185 





I 
271 
279 
666 
1299 
1579 
1341 
2394 

881 
1019 

985 


1094 


2083 
259 
811 

1925 

2523 

2380 
123 


D 

E 

N 
1769 
1370 
920 
663 


879 
831 
879 
1726 
700 
1631 
488 
1579 
586 
1320 
796 
371 
949 
1021 
1494 


2300 
1906 
1453 

887 

586 
1017 

357 
1263 
1982 
1280 
2145 
1036 
2083 


1828 
1272 
504 
653 
1114 
1973 


H 

O 

U 
1605 
1286 
940 
225 
879 


1374 
484 
968 

1056 

1420 
794 

1341 

1017 

1137 
679 

1200 

1645 

1891 

1220 


483 
178 
410 
1070 
1320 
1137 
2136 
660 
1010 
743 
317 
836 
259 
1828 


559 
1668 
2264 
2138 

192 
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L 
A 


2596 
2198 
1745 
1240 

831 
1374 


1603 
2339 
1524 
2451 
1315 
2394 
357 
2136 
1589 
579 
347 
959 
2300 


1038 
662 
262 
547 
796 
679 

1589 
240 

1061 
466 
875 
354 
811 

1272 
559 


1162 
1744 
1724 

712 


M 
E 
M 
1137 
803 
482 
420 
879 
484 
1603 


872 
699 
957 
529 
881 
1263 
660 
240 
1250 
1802 
1867 
765 


2099 
1699 
1260 
999 
371 
1200 
579 
1250 
2089 
987 
1972 
833 
1925 
504 
1668 
1162 


600 
701 
1848 


M 

I 

A 
1255 
1181 
1188 
1111 
1726 
968 
2339 
872 


1511 
1092 
1397 
1019 
1982 
1010 
1061 
2089 
2594 
2734 

923 


2699 
2300 
1858 
1483 

949 
1645 

347 
1802 
2594 
1584 
2571 
1429 
2523 

653 
2264 
1744 

600 


678 
2442 


M 

I 

M 
1123 
731 
355 
862 
700 
1056 
1524 
699 
1511 


1018 
290 
985 

1280 
743 
466 
987 

1584 

1395 
934 


S 
E 


2493 
2117 
1737 
1681 
1021 
1891 
959 
1867 
2734 
1395 
2408 
1369 
2380 
1114 
2138 
1724 

701 

678 





2329 


D 
C 


393 
292 
597 
1185 
1494 
1220 
2300 
765 
923 
934 
230 
1014 
123 
1973 
192 
712 
1848 
2442 
2329 


Exercise 3.33 One’s data in a high dimensional space may lie on a lower dimensional 
sheath. To test for this one might for each data point find the set of closest data points 
and calculate the vector distance from the data point to each of the close points. If the set 
of these distance vectors is a lower dimensional space than the number of distance points, 
then it is likely that the data is on a low dimensional sheath. To test the dimension of 
the space of the distance vectors one might use the singular value decomposition to find 
the singular values. The dimension of the space is the number of large singular values. 
The low singular values correspond to noise or slight curvature of the sheath. To test 


City 


Beijing 
Tianjin 
Shanghai 
Chongqing 
Hohhot 
Urumqi 
Lhasa 
Yinchuan 
Nanning 
Harbin 
Changchun 
Shenyang 


Bei- 
jing 
0 
125 
1239 
3026 
480 
3300 
3736 
1192 
2373 
1230 
979 
684 


Tian- 
jin 
125 

0 
1150 
1954 
604 
3330 
3740 
1316 
2389 
1207 
955 
661 


Shang- 
hai 
1239 
1150 
0 
1945 
1717 
3929 
4157 
2092 
1892 
2342 
2090 
1796 


Chong- 
qing 
3026 
1954 
1945 

0 
1847 
3202 
2457 
1570 
993 
3156 
2905 
2610 


Hoh- 
hot 
480 
604 
1717 
1847 


2825 
3260 
716 
2657 
1710 
1458 
1164 


Urum- 
qi 
3300 
3330 
3929 
3202 
2825 


2668 
2111 
4279 
4531 
4279 
3985 


Lha- 
sa 
3736 
3740 
4157 
2457 
3260 
2668 


2547 
3431 
4967 
4715 
4421 


Yin- 
chuan 
1192 
1316 
2092 
1570 
716 
2111 
2547 


2673 
2422 
2170 
1876 


Nan- 
ning 
2373 
2389 
1892 
993 
2657 
4279 
3431 
2673 


3592 
3340 
3046 


Har- 
bin 
1230 
1207 
2342 
3156 
1710 
4531 
4967 
2422 
3592 


256 
546 


Chang- 
chun 
979 
955 
2090 
2905 
1458 
4279 
4715 
2170 
3340 
256 
0 
294 


Shen- 
yang 
684 
661 
1796 
2610 
1164 
3985 
4421 
1876 
3046 
546 
294 
0 


this concept generate a data set of points that lie on a one dimensional curve in three 
space. For each point find maybe ten nearest points, form the matrix of distance, and do 


a singular value decomposition on the matrix. Report what happens. 


Using code such as the following to create the data. 


function [ data, distance ] 
/creates n data points on a one dimensional sheath in three dimensional 


space 


% 


if nargin==0 


end 


data=zeros(3,n); 
for i=1:n 
x=sin((pi/100) *i) ; 


end 


/subtract adjacent vertices 
distance=zeros(3,10); 

for i=1:5 
distance(:,i)=data(:,i)-data(: ,6); 


end 
end 


n=100; 


y=sqrt (1-x*2) ; 
z=0.003x1i; 


data(: ,i)=[x;y;z]; 


distance(:,i+5)=data(:,i+6)-data(:,6); 
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create_sheath( n ) 


4 Random Walks and Markov Chains 


A random walk on a directed graph consists of a sequence of vertices generated from a 
start vertex by randomly selecting an incident edge, traversing the edge to a new vertex, 
and repeating the process. 


We generally assume the graph is strongly connected, meaning that for any pair of 
vertices x and y, the graph contains a path of directed edges starting at x and ending at 
y. If the graph is strongly connected, then no matter where the walk begins the fraction 
of time the walk spends at the different vertices of the graph converges to a stationary 
probability distribution. 


Start a random walk at a vertex x and think of the starting probability distribution as 
putting a mass of one on x and zero on every other vertex. More generally, one could start 
with any probability distribution p, where p is a row vector with non-negative components 
summing to one, with py being the probability of starting at vertex x. The probability 
of being at vertex x at time t + 1 is the sum over each adjacent vertex y of being at y at 
time t and taking the transition from y to x. Let p(t) be a row vector with a component 
for each vertex specifying the probability mass of the vertex at time t and let p(t + 1) be 
the row vector of probabilities at time t + 1. In matrix notation! 


p(t)P = p(t + 1) 


where the ijt? entry of the matrix P is the probability of the walk at vertex i selecting 
the edge to vertex j. 


A fundamental property of a random walk is that in the limit, the long-term average 
probability of being at a particular vertex is independent of the start vertex, or an initial 
probability distribution over vertices, provided only that the underlying graph is strongly 
connected. The limiting probabilities are called the stationary probabilities. This funda- 
mental theorem is proved in the next section. 


A special case of random walks, namely random walks on undirected graphs, has 
important connections to electrical networks. Here, each edge has a parameter called 
conductance, like electrical conductance. If the walk is at vertex x, it chooses an edge to 
traverse next from among all edges incident to x with probability proportional to its con- 
ductance. Certain basic quantities associated with random walks are hitting time, which 
is the expected time to reach vertex y starting at vertex x, and cover time, which is the 
expected time to visit every vertex. Qualitatively, for undirected graphs these quantities 
are all bounded above by polynomials in the number of vertices. The proofs of these facts 
will rely on the analogy between random walks and electrical networks. 





MProbability vectors are represented by row vectors to simplify notation in equations like the one here. 


TT 





random walk Markov chain 


graph stochastic process 
vertex state 

strongly connected persistent 
aperiodic aperiodic 


strongly connected 


and aperiodic ergodic 
edge weighted 
undirected graph time reversible 











Table 5.1: Correspondence between terminology of random walks and Markov chains 


Aspects of the theory of random walks were developed in computer science with a 
number of applications including defining the pagerank of pages on the World Wide 
Web by their stationary probability. An equivalent concept called a Markov chain had 
previously been developed in the statistical literature. A Markov chain has a finite set of 
states. For each pair of states x and y, there is a transition probability pry of going from 
state x to state y where for each z, se Pay = 1. A random walk in the Markov chain 
starts at some state. At a given time step, if it is in state x, the next state y is selected 
randomly with probability Psy. A Markov chain can be represented by a directed graph 
with a vertex representing each state and a directed edge with weight Pry from vertex x 
to vertex y. We say that the Markov chain is connected if the underlying directed graph 
is strongly connected. That is, if there is a directed path from every vertex to every other 
vertex. The matrix P consisting of the p,, is called the transition probability matrix of 
the chain. The terms “random walk” and “Markov chain” are used interchangeably. The 
correspondence between the terminologies of random walks and Markov chains is given 
in Table 5.1. 


A state of a Markov chain is persistent if it has the property that should the state ever 
be reached, the random process will return to it with probability one. This is equivalent 
to the property that the state is in a strongly connected component with no out edges. 
For most of the chapter, we assume that the underlying directed graph is strongly con- 
nected. We discuss here briefly what might happen if we do not have strong connectivity. 
Consider the directed graph in Figure 4.1b with three strongly connected components, A, 
B, and C. Starting from any vertex in A, there is a non-zero probability of eventually 
reaching any vertex in A. However, the probability of returning to a vertex in A is less 
than one and thus vertices in A, and similarly vertices in B, are not persistent. From 
any vertex in C, the walk eventually will return with probability one to the vertex, since 
there is no way of leaving component C. Thus, vertices in C are persistent. 


A connected Markov Chain is said to be aperiodic if the greatest common divisor of 
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Figure 4.1: (a) A directed graph with some vertices having no out out edges and a 
strongly connected component A with no in edges. 
(b) A directed graph with three strongly connected components. 


the lengths of directed cycles is one. It is known that for connected aperiodic chains, the 
probability distribution of the random walk converges to a unique stationary distribution. 
Aperiodicity is a technical condition needed in this proof. Here, we do not prove this 
theorem and do not worry about aperiodicity at all. It turns out that if we take the av- 
erage probability distribution of the random walk over the first t steps, then this average 
converges to a limiting distribution for connected chains (without assuming aperiodicity) 
and this average is what one uses in practice. We prove this limit theorem and explain 
its uses in what is called the Markov Chain Monte Carlo (MCMC) method. 


Markov chains are used to model situations where all the information of the system 
necessary to predict the future can be encoded in the current state. A typical example 
is speech, where for a small k the current state encodes the last k syllables uttered by 
the speaker. Given the current state, there is a certain probability of each syllable being 
uttered next and these can be used to calculate the transition probabilities. Another 
example is a gambler’s assets, which can be modeled as a Markov chain where the current 
state is the amount of money the gambler has on hand. The model would only be valid 
if the gambler’s bets depend only on current assets, not the past history. 


Later in the chapter, we study the widely used Markov Chain Monte Carlo method 
(MCMC). Here, the objective is to sample a large space according to some probability 
distribution p. The number of elements in the space may be very large, say 101%. One 
designs a Markov chain where states correspond to the elements of the space. The transi- 
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tion probabilities of the chain are designed so that the stationary probability of the chain 
is the probability distribution p with which we want to sample. One chooses samples by 
taking a random walk until the probability distribution is close to the stationary distribu- 
tion of the chain and then selects the current state of the walk. Then the walk continues 
a number of steps until the probability distribution is nearly independent of where the 
walk was when the first element was selected. A second point is then selected, and so on. 
Although it is impossible to store the graph in a computer since it has 101% vertices, to do 
the walk one needs only store the current vertex of the walk and be able to generate the 
adjacent vertices by some algorithm. What is critical is that the probability distribution 
of the walk converges to the stationary distribution in time logarithmic in the number of 
states. 


We mention two motivating examples. The first is to select a point at random in 
d-space according to a probability density such as a Gaussian. Put down a grid and let 
each grid point be a state of the Markov chain. Given a probability density p, design 
transition probabilities of a Markov chain so that the stationary distribution is p. In 
general, the number of states grows exponentially in the dimension d, but if the time 
to converge to the stationary distribution grows polynomially in d, then one can do a 
random walk on the graph until convergence to the stationary probability. Once the sta- 
tionary probability has been reached, one selects a point. To select a set of points, one 
must walk a number of steps between each selection so that the probability of the current 
point is independent of the previous point. By selecting a number of points one can es- 
timate the probability of a region by observing the number of selected points in the region. 


A second example is from physics. Consider an n x n grid in the plane with a particle 
at each grid point. Each particle has a spin of +1. A configuration is a n? dimensional 
vector v = (V1, V2,...,Un2), where v; is the spin of the it? particle. There are gn” spin con- 
figurations. The energy of a configuration is a function f(v) of the configuration, not of 
any single spin. A central problem in statistical mechanics is to sample spin configurations 
according to their probability. It is easy to design a Markov chain with one state per spin 
configuration so that the stationary probability of a state is proportional to the state’s 
energy. If a random walk gets close to the stationary probability in time polynomial in n 
rather than 2””, then one can sample spin configurations according to their probability. 


The Markov Chain has 2”” states, one per configuration. Two states in the Markov 
chain are adjacent if and only if the corresponding configurations v and u differ in just one 
coordinate (u; = v; for all but one i). The Metropilis-Hastings random walk, described 
in more detail in Section 4.2, has a transition probability from a configuration v to an 


adjacent configuration u of 
AN Sac fu) 
= 11]. 
na N ( D 


As we will see, the Markov Chain has a stationary probability proportional to the energy. 
There are two more crucial facts about this chain. The first is that to execute a step in 
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the chain, we do not need the whole chain, just the ratio a. The second is that under 


suitable assumptions, the chain approaches stationarity in time polynomial in n. 


A quantity called the mizing time, loosely defined as the time needed to get close to 
the stationary distribution, is often much smaller than the number of states. In Section 
4.4, we relate the mixing time to a combinatorial notion called normalized conductance 
and derive upper bounds on the mixing time in several cases. 


4.1 Stationary Distribution 


Let p(t) be the probability distribution after t steps of a random walk. Define the 
long-term average probability distribution a(t) by 





a(t) = +(p(0) + p(1) +--+ p(t — 1)). 


The fundamental theorem of Markov chains asserts that for a connected Markov chain, 
a(t) converges to a limit probability vector x that satisfies the equations xP = x. Before 
proving the fundamental theorem of Markov chains, we first prove a technical lemma. 


Lemma 4.1 Let P be the transition probability matrix for a connected Markov chain. 
The n x (n+1) matriz A = |P — I , 1] obtained by augmenting the matrix P — I with an 
additional column of ones has rank n. 


Proof: If the rank of A = [P — J,1] was less than n there would be a subspace of solu- 
tions to Ax = 0 of at least two-dimensions. Each row in P sums to one, so each row in 
P — I sums to zero. Thus x = (1,0), where all but the last coordinate of x is 1, is one 
solution to Ax = 0. Assume there was a second solution (x, œ) perpendicular to (1,0). 
Then (P—J)x+a1 = 0 and for each i, z; = » pit; +a. Each x; is a convex combination 
of some x; plus a. Let S be the set of ¿i for which x; attains its maximum value. Since 
x is perpendicular to 1, some z; is negative and thus S is not empty. Connectedness 
implies that some x, of maximum value is adjacent to some x, of lower value. Thus, 
LE > De Pri; Therefore a must be greater than 0 in £k = a PRD Os, 


Using the same argument with T the set of 7 for which x; takes its minimum value 
implies a < 0. This contradiction falsifies the assumption of a second solution, thereby 
proving the lemma. E 


Theorem 4.2 (Fundamental Theorem of Markov Chains) For a connected Markov 
chain there is a unique probability vector m satisfying TP =m. Moreover, for any starting 
distribution, lim a(t) exists and equals 7. 

—00 


Proof: Note that a(t) is itself a probability vector, since its components are non-negative 
and sum to 1. Run one step of the Markov chain starting with distribution a(t); the 
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distribution after the step is a(t)P. Calculate the change in probabilities due to this step. 











a(t)P — a(t) = 5 [p(0)P + p(1)P +--+ + p(t — 1)P| — E [p(0) +p) +--+ p(t — 1) 
= | (PQ) + PQ) +--+ plt)] — | PO) + PA) +--+ p(t — 1) 
= + (p(t) — p(0)) 


Thus, b(t) = a(t)P — a(t) satisfies |b(t)| < 2 — 0, as t > 00. 


By Lemma 4.1 above, A = [P — 1,1] has rank n. The n x n submatrix B of A 
consisting of all its columns except the first is invertible. Let c(t) be obtained from 
b(t) by removing the first entry. Since a(t)P — a(t) = b(t) and B is obtained by 
deleting the first column of P — J and adding a column of 1's, a(t)B = [e(t), 1]. Then 
a(t) = [c(t), 1B} > [0 , 1|B establishing the theorem with m = [0 , 1]B”?. E 


We finish this section with the following lemma useful in establishing that a probability 
distribution is the stationary probability distribution for a random walk on a connected 
graph with edge probabilities. 


Lemma 4.3 For a random walk on a strongly connected graph with probabilities on the 
edges, if the vector m satisfies TsPsy = TyPyx for all x and y and do 7, = 1, then m is 
the stationary distribution of the walk. 


Proof: Since m satisfies T;pry = TyPyx, Summing both sides, 7; = » TyPys and hence m 


y 
satisfies m = TP. By Theorem 4.2, m is the unique stationary probability. A 


4.2 Markov Chain Monte Carlo 


The Markov Chain Monte Carlo (MCMC) method is a technique for sampling a mul- 
tivariate probability distribution p(x), where x = (1,,12,...,T4). The MCMC method is 
used to estimate the expected value of a function f(x) 


E(f) =X f(x)p(x). 


If each x; can take on two or more values, then there are at least 2% values for x, so an 
explicit summation requires exponential time. Instead, one could draw a set of samples, 
where each sample x is selected with probability p(x). Averaging f over these samples 
provides an estimate of the sum. 


To sample according to p(x), design a Markov Chain whose states correspond to the 
possible values of x and whose stationary probability distribution is p(x). There are two 


general techniques to design such a Markov Chain: the Metropolis-Hastings algorithm 
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and Gibbs sampling, which we will describe in the next two subsections. The Fundamen- 
tal Theorem of Markov Chains, Theorem 4.2, states that the average of the function f 
over states seen in a sufficiently long run is a good estimate of E(f). The harder task 
is to show that the number of steps needed before the long-run average probabilities are 
close to the stationary distribution grows polynomially in d, though the total number of 
states may grow exponentially in d. This phenomenon known as rapid mixing happens for 
a number of interesting examples. Section 4.4 presents a crucial tool used to show rapid 
mixing. 


We used x € R* to emphasize that distributions are multi-variate. From a Markov 
chain perspective, each value x can take on is a state, i.e., a vertex of the graph on which 
the random walk takes place. Henceforth, we will use the subscripts i, 7,k,... to denote 
states and will use p; instead of p(x1,T2,...,T¿) to denote the probability of the state 
corresponding to a given set of values for the variables. Recall that in the Markov chain 
terminology, vertices of the graph are called states. 


Recall the notation that p(t) is the row vector of probabilities of the random walk 
being at each state (vertex of the graph) at time t. So, p(t) has as many components 
as there are states and its i component is the probability of being in state i at time t. 


Recall the long-term t-step average is 





[P(0) + p(1) +- :: + p(t — 1)]. (4.1) 


The expected value of the function f under the probability distribution p is E(f) = 
>=; fipi where f; is the value of f at state i. Our estimate of this quantity will be the 
average value of f at the states seen in a t step walk. Call this estimate y. Clearly, the 
expected value of y is 


t 
Ely) = ` al; Y Prob (walk is in state i at time 5) = X fia;(t). 
i j=l i 


The expectation here is with respect to the “coin tosses” of the algorithm, not with respect 
to the underlying distribution p. Let fmax denote the maximum absolute value of f. It is 
easy to see that 


[Y fp: E] < fas Y [Ps = a4()| = fell — ale) (4.2) 


where the quantity ||p — a(t)||, is the J; distance between the probability distributions p 
and a(t), often called the “total variation distance” between the distributions. We will 
build tools to upper bound ||p — a(t)||,. Since p is the stationary distribution, the t for 
which ||p — a(t)||, becomes small is determined by the rate of convergence of the Markov 
chain to its steady state. 


The following proposition is often useful. 
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Proposition 4.4 For two probability distributions p and q, 
lp -alı =2 (p: — as)? =2 0 (a 


where x™ = x if x > Vand sx? =0 if x <0. 


The proof is left as an exercise. 


4.2.1 Metropolis-Hasting Algorithm 


The Metropolis-Hasting algorithm is a general method to design a Markov chain whose 
stationary distribution is a given target distribution p. Start with a connected undirected 
graph G on T set of states. If the states are the lattice points (£1, £2,..., £4) in Ri 
with z; € {0,1,2,,...,n}, then G could be the lattice graph with 2d coordinate edges at 
each interior vertex. In general, let r be the maximum degree of any vertex of G. The 
transitions of the Markov chain are defined as follows. At state i select neighbor j with 
probability E, Since the degree of ¿ may be less than r, with some probability no edge 
is selected and the walk remains at 7. If a neighbor j is selected and p; > p;, go to j. If 
pj < pi, go to j with probability p;/p; and stay at i with probability 1 — a Intuitively, 
this favors “heavier” states with higher p; values. For ¿i adjacent to 7 in G, 


1 
Pij = — min (1, 2) 
F Pi 


and 
Pau =1= > py. 
¡Ai 
Thus, 


Pi . Pj Do Pj . Pi 
pipy = Ë min (1,22) = + min(p, py) = 2 min (1,2) = pipa 
r Pi r a Pj 


By Lemma 4.3, the stationary probabilities are indeed p; as desired. 


Example: Consider the graph in Figure 4.2. Using the Metropolis-Hasting algorithm, 
assign renion prope so that ibe stationary probability of a random walk is 
pla) = 3, p(b) = 3, p(c) = 4, and p(d) = ¿. The amum degree of any vertex is three, 


so at a, the probability of taking the edge la, b) is 3 12 Of: The Probability of taking the 
edge a c) is 381 oF + and of taking the edge (a, d) is 112 or $. Thus, the probability 
of staying at a is z The probability of taking the edge from b to a is z The probability 
of taking the des from c to a is 3 and the probability of taking mue edge from d to a is 
E, Thus, the stationary probability of a is +4 53 T ] 53 = E, which is the desired 


probability. E 
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Figure 4.2: Using the Metropolis-Hasting algorithm to set probabilities for a random 
walk so that the stationary probability will be the desired probability. 


4.2.2 Gibbs Sampling 


Gibbs sampling is another Markov Chain Monte Carlo method to sample from a 
multivariate probability distribution. Let p(x) be the target distribution where x = 
(21,...,T4). Gibbs sampling consists of a random walk on an undirectd graph whose 
vertices correspond to the values of x = (11,...,t¿) and in which there is an edge from 
x to y if x and y differ in only one coordinate. Thus, the underlying graph is like a 
d-dimensional lattice except that the vertices in the same coordinate line form a clique. 


To generate samples of x = (x,...,2a) with a target distribution p(x), the Gibbs 
sampling algorithm repeats the following steps. One of the variables x; is chosen to be 
updated. Its new value is chosen based on the marginal probability of x; with the other 
variables fixed. There are two commonly used schemes to determine which x; to update. 
One scheme is to choose x; randomly, the other is to choose x; by sequentially scanning 
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z p(1,1)=3 
12 p(1,2) = 3 
p(1,3) = § 
p(2,1) = 3 
3 p(2,2) = 5 
p(2,3) = 5 
p(3,1) = § 
3 p(3,2) = 5 
p(3,3) = 5 





Calculation of edge probability pa1)12) 











1 í Eo a MM 1 14_ 1 

pajan = gPi2/ Put be + pis) = 5 (0/6 as) = s/s a 3 

Edge probabilities. i Ad a AA E 

Pani =313=35 PAVAY=233=9 Pasan 23335 PDJ = 3539 
114_1 =114>1 = 1141 — 118. 1 

Pava) =353=.9 200002639 P030)— 2436 PCD) 2123 9 
118 1 =112-1 =113 1 adis at 

Panen = 38357 p 200027267 T7 PA) S212178 PDU) S 235 = 15 
118 2 —liib_tl =1131 =118-2 

Pane) = 5% F = ig 209683267 = 7 Pas) 21217 8 PGI) 265 — 15 


Verification of a few edges, pipij = PjPji- 




















P11P (11) (12) i 7 z 9 — P12P(12)(11) 
P11P(11)(13) 5 3 1 2 P13P(13)(11) 
PUPAnEel) 5 $ 3 4 P21P(21)(11) 


Note that the edge probabilities out of a state such as (1,1) do not add up to one. 


That is, with some probability the walk stays at the state as it is in. For Sarin 


Panan = 1 — (Panas) + Panas) + Pane + Panay) =1-§-_- BTM = 





Figure 4.3: Using the Gibbs algorithm to set probabilities for a random walk so that 
the stationary probability will be a desired probability. 
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from zı to Za. 


Suppose that x and y are two states that differ in only one coordinate. Without loss 
of generality let that coordinate be the first. Then, in the scheme where a coordinate is 
randomly chosen to modify, the probability pxy of going from x to y is 


1 
Pxy = ¿Pili 23, tee , £a). 


Similarly, 


1 
Pyx = Pal, Y3,-.. , Ya) 
1 
=>3p(51|52, £3,..-, Za): 


d 
Here use was made of the fact that for 7 4 1, x; = yj. 


It is simple to see that this chain has stationary probability proportional to p(x). 
Rewrite Pxy as 











_ 1p(yilz2,23,..., Ta)p(t2,T3,.-., Ta) 
Y d DTD £3, A Za) 

_ l p(y, £2, £3,..., Ta) 

© d p(£2,£3,..., £a) 

o1 ply) 

—dplxa,t3,...,Ta) 


again using x; = y; for j # 1. Similarly write 


ee p(x) 
ús dp(x>, 3, ..., Ta) 





from which it follows that p(x)Pry = p(y)Pyw»- By Lemma 4.3 the stationary probability 
of the random walk is p(x). 


4.3 Areas and Volumes 


Computing areas and volumes is a classical problem. For many regular figures in 
two and three dimensions there are closed form formulae. In Chapter 2, we saw how to 
compute volume of a high dimensional sphere by integration. For general convex sets in 
d-space, there are no closed form formulae. Can we estimate volumes of d-dimensional 
convex sets in time that grows as a polynomial function of d? The MCMC method answes 
this question in the affirmative. 
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One way to estimate the area of the region is to enclose it in a rectangle and estimate 
the ratio of the area of the region to the area of the rectangle by picking random points 
in the rectangle and seeing what proportion land in the region. Such methods fail in high 
dimensions. Even for a sphere in high dimension, a cube enclosing the sphere has expo- 
nentially larger area, so exponentially many samples are required to estimate the volume 
of the sphere. 


It turns out, however, that the problem of estimating volumes of sets can be reduced 
to the problem of drawing uniform random samples from sets. Suppose one wants to 
estimate the volume of a convex set R. Create a concentric series of larger and larger 
spheres!? S1,S3,..., Sy such that Sı is contained in R and S; contains R. Then 

Vol(Sj N R) Vol(S;-1 NR)  Vol(S3 N R) 


l = Vol = owe 1 
Moll) IB A Aa Vs naa 





If the radius of the sphere S; is 1 + z times the radius of the sphere S;_,, then we have: 


E Vol(Si1 N R) 


because Vol(S;)/Vol(S;-1) = (1 + ie < e, and the fraction of S; occupied by R is less 


than or equal to the fraction of S;-; occupied by R (due to the convexity of R and the 
Vol(SiAR) 
Vol(S;_-1NR) 
be estimated by rejection sampling, i.e., selecting points in S; N R uniformly at random 
and computing the fraction in S;_; N R, provided one can select points at random from a 


d-dimensional convex region. 


fact that the center of the spheres lies in R). This implies that the ratio can 


Solving (1 + 5)* = r for k where r is the ratio of the radius of S, to the radius of Sj 
bounds the number of spheres. 


k= O(log +(1/a) r) = O(d In r)!* 


This means that it suffices to estimate each ratio to a factor of (1 + 
estimate the overall volume to error 1 + e. 


z7) in order to 


It remains to show how to draw a uniform random sample from a d-dimensional convex 
set. Here we will use the convexity of the set R and thus the sets S¿A.R so that the Markov 
chain technique will converge quickly to its stationary probability. To select a random 
sample from a d-dimensional convex set, impose a grid on the region and do a random 
walk on the grid points. At each time, pick one of the 2d coordinate neighbors of the 
current grid point, each with probability 1/(2d) and go to the neighbor if it is still in the 
set; otherwise, stay put and repeat. If the grid length in each of the d coordinate directions 





150One could also use rectangles instead of spheres. 
16Using logar = logor and In(1+x)<x. 


logpa 
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Figure 4.4: By sampling the area inside the dark line and determining the fraction of 
points in the shaded region we compute nee 

To sample we create a grid and assign a pro ability of one to each grid point inside the 
dark lines and zero outside. Using Metropolis-Hasting edge probabilities the stationary 
probability will be uniform for each point inside the the region and we can sample points 


uniformly and determine the fraction within the shaded region. 


is at most some a, the total number of grid points in the set is at most a”. Although this 
is exponential in d, the Markov chain turns out to be rapidly mixing (the proof is beyond 
our scope here) and leads to polynomial time bounded algorithm to estimate the volume 
of any convex set in R?. 


4.4 Convergence of Random Walks on Undirected Graphs 


In an undirected graph where TzPxy = TyPyw», edges can be assigned weights such 
that Poy = E, ua See Exercise 4.19. Thus the Metropolis-Hasting algorithm and Gibbs 
sampling both involve random walks on edge-weighted undirected graphs. Given an edge- 
weighted undirected graph, let w,, denote the weight of the edge between nodes x and y, 
with w,, = 0 if no such edge exists. Let ws = )> y Way: The Markov chain has transition 


probabilities Pry = Wzy/w,. We assume the chain is connected. 





We now claim that the stationary distribution m of this walk has 7, proportional to 
Wa, 1.€., Te = Wa /Wiotal for Wiotal = D>, War. Specifically, notice that 
Wry Wyz 


Wy Wyz = W 
We y y y 











WiPry = Wr 





= WyPyz- 


Therefore (Wz/Wtotal)Pry = (Wy/Wtotal) Pys and Lemma 4.3 implies that the values 7, = 
Wz/Wtotal are the stationary probabilities. 


An important question is how fast the walk starts to reflect the stationary probability 


of the Markov process. If the convergence time was proportional to the number of states, 
algorithms such as Metropolis-Hasting and Gibbs sampling would not be very useful since 
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Figure 4.5: A network with a constriction. All edges have weight 1. 


the number of states can be exponentially large. 


There are clear examples of connected chains that take a long time to converge. A 
chain with a constriction, see Figure 4.5, takes a long time to converge since the walk is 
unlikely to cross the narrow passage between the two halves, both of which are reasonably 
big. We will show in Theorem 4.5 that the time to converge is quantitatively related to 
the tightest constriction. 


We define below a combinatorial measure of constriction for a Markov chain, called the 
normalized conductance. We will relate normalized conductance to the time by which the 
average probability distribution of the chain is guaranteed to be close to the stationary 
probability distribution. We call this e-mixing time: 


Definition 4.1 Fire > 0. The e-mixing time of a Markov chain is the minimum integer t 
such that for any starting distribution p, the 1-norm difference between the t-step running 
average probability distribution!” and the stationary distribution is at most e. E 


Definition 4.2 For a subset S of vertices, let n(S) denote > ¿5 Ta. The normalized 
conductance P(S) of S is 
y TxPey 


— (aels,S) 
P min (m(S), T(S)) 





a 
There is a simple interpretation of ®(S). Suppose without loss of generality that a(S) < 


m(S). Then, we may write ®(S) as 


To 
E) ==, T(S) Y Pav: 
TES II YES 
i b 








Recall that a(t) = + (p(0) + p(1) + --- + p(t — 1)) is called the running average distribution. 
t 
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Here, a is the probability of being in x if we were in the stationary distribution restricted 
to S and bis the probability of stepping from x to S in a single step. Thus, P(S) is the 
probability of moving from S to S in one step if we are in the stationary distribution 
restricted to S. 


Definition 4.3 The normalized conductance of the Markov chain, denoted ®, is defined 
by 
p= min ($). 
SCV,SAL} 

As we just argued, normalized conductance being high is a necessary condition for 
rapid mixing. The theorem below proves the converse that normalized conductance being 
high is sufficient for mixing. Intuitively, if ® is large, the walk rapidly leaves any subset 
of states. But the proof of the theorem is quite difficult. After we prove it, we will see 
examples where the mixing time is much smaller than the cover time. That is, the number 
of steps before a random walk reaches a random state independent of its starting state is 
much smaller than the average number of steps needed to reach every state. In fact, for 
graphs whose conductance is bounded below by a constant, called expanders, the mixing 
time is logarithmic in the number of states. 


Theorem 4.5 The e-mixing time of a random walk on an undirected graph is 


O (=) 


P23 
where Tmin is the minimum stationary probability of any state. 


Proof: Let t = a o for a suitable constant c. Let 





a = a(t) = -(p(0) + p(1) +--- + p(t — 1)) 


be the running average distribution. We need to show that ||a — m||} < e. Let 


Qi 
Ui = —, 
Ti 
and renumber states so that vı > vg > v3 > ---. Thus, early indices ¿ for which v; > 1 
are states that currently have too much probability, and late indices ¿ for which v; < 1 
are states that currently have too little probability. 


Intuitively, to show that ||a — m||; < e it is enough to show that the values v; are 
relatively flat and do not drop too fast as we increase 2. We begin by reducing our goal 
to a formal statement of that form. Then, in the second part of the proof, we prove that 
v; do not fall fast using the concept of “probability flows”. 
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em g(x) 


Gi= {1}; G2 = 12,3,4); G3 = {5}. 








Figure 4.6: Bounding lı distance. 


We call a state i for which v; > 1 “heavy” since it has more probability according to 
a than its stationary probability. Let 79 be the maximum + such that v; > 1; it is the last 
heavy state. By Proposition (4.4): 


la- rll = 29 0i- Dm = 2 Y (1-o)m. (4.3) 


i>io+1 
Let 
Yi = Ti Tg +: H Ti. 
Define a function f : [0, yo] > R by f(x) = vi— 1 for x € |[yi-1, yi). See Figure 4.6. Now, 


So; — 1)m = / arc (4.4) 


i=1 


We make one more technical modification. We divide {1, 2, . . . , io } into groups G1, G2, G3,... 


of contiguous subsets. We specify the groups later. Let us = Max;¡eq,v; be the maximum 
value of v; within Gs. Define a new function g(x) by g(x) = us — 1 for x € Use. [Yi-1, Y); 
see Figure 4.6. Since g(x) > f(x) 


f TE l Tae (4.5) 


We now assert (with u,,1 = 1): 
Yio T 
f g(x) dz = So (Gi UG2U...U Gt) (ut — uipi). (4.6) 
0 t=1 
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Gr, 


This is just the statement that the area under g(x) in the figure is exactly covered by the 
rectangles whose bottom sides are the dotted lines. We leave the formal proof of this to 
the reader. We now focus on proving that 


a e Ue A (4.7) 


s=1 


for a sub-division into groups we specify which suffices by 4.3, 4.4, 4.5 and 4.6. While we 
start the proof of (4.7) with a technical observation (4.8), its proof will involve two nice 
ideas: the notion of probability flow and reckoning probability flow in two different ways. 
First, the technical observation: if 2 $ ,>;, y, (1-01), < € then we would be done by (4.3). 


So assume now that > ¿>;, +1 (1 — vi)m > €/2 from which it follows that > ;>;,,1 Ti > €/2 
and so, for any subset A of heavy nodes, 
Min(r(A), T(A)) > =m (A). (4.8) 


We now define the subsets. G will be just {1}. In general, suppose G1, G2,...,Gs_1 have 
already been defined. We start G, at is = 1+ (end of G,_,). Let i, = k. We will define 
l, the last element of G, to be the largest integer greater than or equal to k and at most 





io so that 
l 
EY, 
2 Ne 4 
j=k+1 
In Lemma 4.6 which follows this theorem prove that for groups G1, G2, ... , Gr, U1.43,... , Ur, Ur+1 
as above 


m™(Gy U Go ca Gs)(us = Us+1) < Ipe 


Now to prove (4.7), we only need an upper bound on r, the number of groups. If G, = 
{k,k +1,...,l}, with l < io, then by definition of l, we have y41 > (1 + 2: So, 
r < In14+(e0/2 (1/71) +2 < In(1/71)/(£0/2) +2. This completes the proof of (4.7) and the 
theorem. A 


We complete the proof of Theorem 4.5 with the proof of Lemma 4.6. The notation in the 
lemma is that from the theorem. 


Lemma 4.6 Suppose groups G¡,,G2,..., Gr, U1-U2,..., Ur, Ur+1 are as above. Then, 


(Gy U Go ¡DA Gs) (us — Us+1) < Ipe 

Proof: This is the main lemma. The proof of the lemma uses a crucial idea of probability 
flows. We will use two ways of calculating the probability flow from heavy states to light 
states when we execute one step of the Markov chain starting at probabilities a. The 
probability vector after that step is aP. Now, a — aP is the net loss of probability for 
each state due to the step. 
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Consider a particular group Gs = {k, k + 1,...,l}, say. First consider the case when 
k < ig. Let A = {1,2,...,k}. The net loss of probability for each state from the set A in 
one step is Ni (a; — (aP);) which is at most 2 by the proof of Theorem 4.2. 


Another way to reckon the net loss of probability from A is to take the difference of 
the probability flow from A to A and the flow from A to A. For any i < j, 


net-flow(i, j) = flow(i, j) — flow(j, i) = TiPijU; — TiPjiUj = Tjpjilvi — vj) > 0, 


Thus, for any two states 7 and 7, with 7 heavier than j, i.e., i < j, there is a non-negative 
net flow from i to j. (This is intuitively reasonable since it says that probability is flowing 
from heavy to light states.) Since l > k, the flow from A to {k+1,k+2,...,1/} minus the 
flow from {k + 1,4 + 2,...,1} to A is non-negative. Since for i < k and j > l, we have 
vi > Ug and vj < v1, the net loss from A is at least 


So tpat — 05) > (Vk — Viy) mpi: 


i<k ¿Sk 
j>l j>l 
Thus, 
2 
(ve — 031) Y Tipi < Pe (4.9) 
i<k 
j>l 
Since 


ko l 

Y) Y mp S Y my <ebr(4)/4 
i=1 j=k+1 j=k+1 

and by the definition of ®, using (4.8) 


XO njpj > DMin(r(A), T(A)) > by /2, 


i<k<j 


we have, >) Tjpj = Di<pej MPI — Licejer MPI 2 EP 71 /4. Substituting this into the 
i<k E =< 
pl 

inequality (4.9) gives 





Uk — Ul+1 < (4.10) 


tebyg” 


proving the lemma provided k < ip. If k = ig, the proof is similar but simpler. A 


Theorem 4.5 gives an upper bound on the mixing time in terms of the conductance. 
Conversely, 2(1/®) is a known lower bound. We do not prove this here. 


94 


4.4.1 Using Normalized Conductance to Prove Convergence 


We now apply Theorem 4.5 to some examples to illustrate how the normalized con- 
ductance bounds the rate of convergence. In each case we compute the mixing time for 
the uniform probability function on the vertices. Our first examples will be simple graphs. 
The graphs do not have rapid converge, but their simplicity helps illustrate how to bound 
the normalized conductance and hence the rate of convergence. 


A 1-dimensional lattice 


Consider a random walk on an undirected graph consisting of an n-vertex path with 
self-loops at the both ends. With the self loops, we have Pey = 1/2 on all edges (x,y), 
and so the stationary distribution is a uniform 4 over all vertices by Lemma 4.3. The set 
with minimum normalized conductance is the set S with probability 7(S) < 5 having the 
smallest ratio of probability mass exiting it, Dayel 5,5) TaPzy, tO probability mass inside 
it, T(S). This set consists of the first n/2 vertices, for which the numerator is + and 
denominator is L. Thus, 


By Theorem 4.5, for e a constant such as 1/100, after O((n?log n)/e?) steps, ||as— || < 
1/100. This graph does not have rapid convergence. The hitting time and the cover time 
are O(n?). In many interesting cases, the mixing time may be much smaller than the 
cover time. We will see such an example later. 


A 2-dimensional lattice 


Consider the n x n lattice in the plane where from each point there is a transition to 
each of the coordinate neighbors with probability 1/4. At the boundary there are self-loops 
with probability 1-(number of neighbors)/4. It is easy to see that the chain is connected. 
Since pj; = pji, the function f; = 1/n? satisfies fip;; = fjpji and by Lemma 4.3, f is the 
stationary distribution. Consider any subset S consisting of at most half the states. If 
S| > = then the subset with the fewest edges leaving it consists of some number of 
columns plus perhaps one additional partial column. The number of edges leaving S is at 


least n. Thus i i 
iES jes 


Since |S| > 2, in this case 


P(S) >Q (Es) -Q (=). 


If [5] < a the subset S of a given size that has the minimum number of edges leaving 
consists of a square located at the lower left hand corner of the grid (Exercise 4.24). If 
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|S| is not a perfect square then the right most column of S is short. Thus at least 2/15] 
points in S are adjacent to points in S. Each of these points contributes 7;p,;; = Q(%) to 


the flow(S, S). Thus, 
ey |5| 
ON mpy 2 n2 
1€S jes 
min (a(S), 7(S)) T [S|/n? VIS] 


n 
Thus, in either case, after O(n? Inn/e?) steps, |a(t) — mh < e. 





and 








A lattice in d-dimensions 


Next consider the n x n x --- x n lattice in d-dimensions with a self-loop at each 
boundary point with probability 1 — (number of neighbors)/2d. The self loops make all 
m; equal to n~¢. View the lattice as an undirected graph and consider the random walk 
on this undirected graph. Since there are n? states, the cover time is at least n? and 
thus exponentially dependent on d. It is possible to show (Exercise 4.25) that ® is Q(+). 
Since all 7; are equal to n”?, the mixing time is O(d?n? ln n/e?), which is polynomially 
bounded in n and d. 


The d-dimensional lattice is related to the Metropolis-Hastings algorithm and Gibbs 
sampling although in those constructions there is a non-uniform probability distribution at 
the vertices. However, the d-dimension lattice case suggests why the Metropolis-Hastings 
and Gibbs sampling constructions might converge fast. 


A clique 


Consider an n vertex clique with a self loop at each vertex. For each edge, pry = 1 
and thus for each vertex, Ty = L, Let S be a subset of the vertices. Then 


18 
Nr = El 


TES 


= 1 ee 
Y Tapey = Tapals = 318115] 





(x,y)€(S,S) 
and > i = 
£ y TaPa 3 S S — 1 
p(s) = Somes) Tebe AB nas, (5) = E 
min (x(5),*(5)) — Emin(]S|.[5) 0 2 


This gives a bound on the ¢-mixing time of 


In + Inn 
ol 0263 -0 (25). 
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However, a walker on the clique starting from any probability distribution will in one step 
be exactly at the stationary probability distribution. 


A connected undirected graph 


Next consider a random walk on a connected n vertex undirected graph where at each 
vertex all edges are equally likely. The stationary probability of a vertex equals the degree 
of the vertex divided by the sum of degrees. That is, if the degree of vertex x is d, and 
the number of edges in the graph is m, then 7, = za. Notice that for any edge (x,y) we 


have 
SKE TaN, E! 
"aPoy ~ | 2 da) 2m’ 


Therefore, for any S, the total conductance of edges out of S is at least z, and so 
® is at least +. Since Min > => > =z, In- = O(lnn). Thus, the mixing time is 
O(m? In n/e*) = O(n* Inn/e?). 








The Gaussian distribution on the interval [-1,1] 


Consider the interval [—1,1]. Let ô be a “grid size” specified later and let G be the 
graph consisting of a path on the + +1 vertices {—1, —1 +ô, —1 +2ô,...,1— ô, 1} having 
self loops at the two ends. Let m, = ce“ for x € {—1,-1+6,-1+4 26,...,1— 6,1} 
where a > 1 and c has been adjusted so that >>, ms = 1. 


We now describe a simple Markov chain with the mz as its stationary probability and 
argue its fast convergence. With the Metropolis-Hastings’ construction, the transition 
probabilities are 


1 e70(2+8)? 1 e aa) 
Pess = = min | 1, ———— | and Pre- = = min | 1, ———— |]. 
2 omar i 2 ere 


Let S be any subset of states with m (S) < L. First consider the case when S is an interval 
[kó, 1] for k > 2. It is easy to see that 


Tr(S) < / ceo” de 


=(k—1)6 


< e E ee” dr 
(e195 (k — 1)6 
Dr (ar) | 
a(k — 1)d 


Now there is only one edge from S to S and total conductance of edges out of S is 


: —ak? 2 —a(k—1)262 —ak? 62 
> J TiPij = TróPrs,(r-1s = min(ce , ce (k—1) )=ce i 
¡eS j¢S 
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Using 2 < k < 1/0, a > 1, and r(S) < 1, 
flow(S, S) a252 A(k— 1)ô 
(S) = Te ONO ae Sa 
(5) min(r(S), T(S)) — dé ce (19) 
> Día(k — ije T eD > O(ade7 0”), 





For the grid size less than the variance of the Gaussian distribution, ô < E, we have ad < 1, 
so e70(c2) = Q(1), thus, 9(S) > O(ad). Now, Tmin > ce~* > e71’, so In(1/Tmin) < 1/6. 


If S is not an interval of the form [k,1] or [—1,k], then the situation is only better 
since there is more than one “boundary” point which contributes to flow(S, S). We do 
not present this argument here. By Theorem 4.5 in Q(1/a?9*%*) steps, a walk gets within 
€ of the steady state distribution. 


In the uniform probability case the e-mixing time is bounded by n? log n. For compar- 
ison, in the Gaussian case set 6 = 1/n and a = 1/3. This gives an e-mixing time bound 
of n*. In the Gaussian case with the entire initial probability on the first vertex, the chain 
begins to converge faster to the stationary probability than the uniform distribution case 
since the chain favors higher degree vertices. However, ultimately the distribution must 
reach the lower probability vertices on the other side of the Gaussian’s maximum and 
here the chain is slower since it favors not leaving the higher probability vertices. 


In these examples, we have chosen simple probability distributions. The methods ex- 
tend to more complex situations. 


4.5 Electrical Networks and Random Walks 


In the next few sections, we study the relationship between electrical networks and 
random walks on undirected graphs. The graphs have non-negative weights on each edge. 
A step is executed by picking a random edge from the current vertex with probability 
proportional to the edge’s weight and traversing the edge. 


An electrical network is a connected, undirected graph in which each edge (x,y) has 
a resistance rz, > 0. In what follows, it is easier to deal with conductance defined as the 
reciprocal of resistance, Cry = = rather than resistance. Associated with an electrical 
network is a random walk on the underlying graph defined by assigning a probability 
Pry = Cay/Ca to the edge (x, y) incident to the vertex x, where the normalizing constant c, 
equals Y > Cy. Note that although c,, equals cyx, the probabilities Pry and Pys may not be 


y 
equal due to the normalization required to make the probabilities at each vertex sum to 
one. We shall soon see that there is a relationship between current flowing in an electrical 
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network and a random walk on the underlying graph. 


Since we assume that the undirected graph is connected, by Theorem 4.2 there is 
a unique stationary probability distribution.The stationary probability distribution is 7 
where Ty, = with co = >> cz. To see this, for all x and y 


R iye 
TxPry = = = TyPyx 
Co Cy Co Cy 





and hence by Lemma 4.3, m is the unique stationary probability. 
Harmonic functions 


Harmonic functions are useful in developing the relationship between electrical net- 
works and random walks on undirected graphs. Given an undirected graph, designate 
a non-empty set of vertices as boundary vertices and the remaining vertices as interior 
vertices. A harmonic function g on the vertices is a function whose value at the boundary 
vertices is fixed to some boundary condition, and whose value at any interior vertex x is 
a weighted average of its values at all the adjacent vertices y, with weights p,, satisfying 
a Pay = 1 for each x. Thus, if at every interior vertex x for some set of weights Py 


satisfying My Pry = 1, Gx = Y GyPry, then g is an harmonic function. 
y 


Example: Convert an electrical network with conductances Czy to a weighted, undirected 
graph with probabilities Py. Let f be a function satisfying fP = f where P is the matrix 
of probabilities. It follows that the function gy = de is harmonic. 


Cy 


1 Czy — fy “ay _ 
Se ody ee o, ct = J 9uDay 
y y y 


sx— 1 = AL x 
ge = 2 = 2D fpu = 20 fy E 
y y 





A harmonic function on a connected graph takes on its maximum and minimum on 
the boundary. This is easy to see for the following reason. Suppose the maximum does 
not occur on the boundary. Let S be the set of vertices at which the maximum value is 
attained. Since S contains no boundary vertices, S is non-empty. Connectedness implies 
that there is at least one edge (x,y) with x € S and y € S. The value of the function at 
x is the weighted average of the value at its neighbors, all of which are less than or equal 
to the value at x and the value at y is strictly less, a contradiction. The proof for the 
minimum value is identical. 


There is at most one harmonic function satisfying a given set of equations and bound- 
ary conditions. For suppose there were two solutions, f(x) and g(x). The difference of two 
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Graph with boundary vertices T of harmonic or 
dork and: boundarsy-conditions satisfying boundary conditions 
speciied where the edge weights at 


each vertex are equal 


Figure 4.7: Graph illustrating an harmonic function. 


solutions is itself harmonic. Since h(x) = f(x) —g(x) is harmonic and has value zero on the 
boundary, by the min and max principles it has value zero everywhere. Thus f(x) = g(x). 


The analogy between electrical networks and random walks 


There are important connections between electrical networks and random walks on 
undirected graphs. Choose two vertices a and b. Attach a voltage source between a and b 
so that the voltage va equals one volt and the voltage v, equals zero. Fixing the voltages 
at va and v, induces voltages at all other vertices, along with a current flow through the 
edges of the network. What we will show below is the following. Having fixed the voltages 
at the vertices a and b, the voltage at an arbitrary vertex x equals the probability that a 
random walk that starts at x will reach a before it reaches b. We will also show there is 
a related probabilistic interpretation of current as well. 


Probabilistic interpretation of voltages 


Before relating voltages and probabilities, we first show that the voltages form a har- 
monic function. Let x and y be adjacent vertices and let 7,, be the current flowing through 


the edge from x to y. By Ohm’s law, 
Laie ce (Uz — Vy) Cay: 
ES 


By Kirchhoff’s law the currents flowing out of each vertex sum to zero. 
y 
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Replacing currents in the above sum by the voltage difference times the conductance 


yields 
E (vz — Uy)Coy =0 


y 


Ue > Cay = > ice 
y y 


Observing that >> C,y = Cs and that Pry = a yields. vge, = J, VyPryCs. Hence, 
y 


or 


y 
Uz = Y VyPry. Thus, the voltage at each vertex x is a weighted average of the volt- 
y 


ages at the adjacent vertices. Hence the voltages form a harmonic function with {a, b} as 
the boundary. 


Let py be the probability that a random walk starting at vertex x reaches a before b. 
Clearly pa = 1 and py = 0. Since vg = 1 and v, = 0, it follows that pz = vg and py = vp. 
Furthermore, the probability of the walk reaching a from x before reaching b is the sum 
over all y adjacent to x of the probability of the walk going from x to y in the first step 
and then reaching a from y before reaching b. That is 


Pa = ` PxyPy: 
y 


Hence, p. is the same harmonic function as the voltage function v, and v and p satisfy the 
same boundary conditions at a and b.. Thus, they are identical functions. The probability 
of a walk starting at x reaching a before reaching b is the voltage vg. 


Probabilistic interpretation of current 


In a moment, we will set the current into the network at a to have a value which we will 
equate with one random walk. We will then show that the current tisy is the net frequency 
with which a random walk from a to b goes through the edge xy before reaching b. Let 
uz be the expected number of visits to vertex x on a walk from a to b before reaching b. 
Clearly u, = 0. Consider a node x not equal to a or b. Every time the walk visits x, it 
must have come from some neighbor y. Thus, the expected number of visits to x before 
reaching b is the sum over all neighbors y of the expected number of visits uy to y before 
reaching b times the probability py, of going from y to x. That is, 


Uy = J UyPyz- 
y 


Since CrPry = CyPyr 





Ux 


and hence a= 5 Day: It follows that en is harmonic with a and b as the boundary 


y 
where the boundary conditions are u, = 0 and ua equals some fixed value. Now, = = 


Setting the current into a to one, fixed the value of va. Adjust the current into a so that 
Va equals “*. Now +2 and v, satisfy the same boundary conditions and thus are the same 
harmonic function. “Tek the current into a correspond to one walk. Note that if the walk 
starts at a and ends at b, the expected value of the difference between the number of times 
the walk leaves a and enters a must be one. This implies that the amount of current into 
a corresponds to one walk. 


Next we need to show that the current tisy is the net frequency with which a random 
walk traverses edge xy. 


. ee fe Ny tc E Cri 
tiy = (Ur =ü) | | C a ea ~ Uy Dya 
Cx. Cy Cz Gi 


The quantity uzpzy is the expected number of times the edge zy is traversed from x to y 
and the quantity UyPyz is the expected number of times the edge zy is traversed from y to 
x. Thus, the current izy is the expected net number of traversals of the edge zy from z to y. 


Effective resistance and escape probability 


Set vg = 1 and v = 0. Let ia be the current flowing into the network at vertex a and 
out at vertex b. Define the effective resistance reg between a and b to be reg = A and 
the effective conductance Ceg to be Cep = T Define the escape probability, Pescape, tO 
be the probability that a random walk starting at a reaches b before returning to a. We 
now show that the escape probability is So . For convenience, assume that a and b are 


not adjacent. A slight modification of the argument suffices for the case when a and b are 


adjacent. 
= Ett 





Since va = 1, 


Cay 
la = J Coy Ci > oC 
Ca 
y y 
iba A Es J ray ; 
y 


For each y adjacent to the vertex a, Pay is the probability of the walk going from vertex 
a to vertex y. Earlier we showed that vu, is the probability of a walk starting at y going 
to a before reaching b. Thus, >, Payo, is the probability of a walk starting at a returning 


y 
to a before reaching b and 1 — Y payvy is the probability of a walk starting at a reaching 
y 
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b before returning to a. Thus, ia = CaPescape- VINCE Va = 1 and cef = >a it follows that 
Ceff = Ta Thus, Ceff = CaPescape aNd hence Pescape = cin 


Ca 





For a finite connected graph, the escape probability will always be non-zero. Consider 
an infinite graph such as a lattice and a random walk starting at some vertex a. Form a 
series of finite graphs by merging all vertices at distance d or greater from a into a single 
vertex b for larger and larger values of d. The limit of Pescape as d goes to infinity is the 
probability that the random walk will never return to a. If Pescape —> 0, then eventually 
any random walk will return to a. If Pescape > q Where q > 0, then a fraction of the walks 
never return. Thus, the escape probability terminology. 


4.6 Random Walks on Undirected Graphs with Unit Edge Weights 


We now focus our discussion on random walks on undirected graphs with uniform 
edge weights. At each vertex, the random walk is equally likely to take any edge. This 
corresponds to an electrical network in which all edge resistances are one. Assume the 
graph is connected. We consider questions such as what is the expected time for a random 
walk starting at a vertex x to reach a target vertex y, what is the expected time until the 
random walk returns to the vertex it started at, and what is the expected time to reach 
every vertex? 


Hitting time 


The hitting time hry, sometimes called discovery time, is the expected time of a ran- 
dom walk starting at vertex x to reach vertex y. Sometimes a more general definition is 
given where the hitting time is the expected time to reach a vertex y from a given starting 
probability distribution. 


One interesting fact is that adding edges to a graph may either increase or decrease 
hzy depending on the particular situation. Adding an edge can shorten the distance from 
x to y thereby decreasing h,, or the edge could increase the probability of a random walk 
going to some far off portion of the graph thereby increasing hgy. Another interesting 
fact is that hitting time is not symmetric. The expected time to reach a vertex y from a 
vertex x in an undirected graph may be radically different from the time to reach x from y. 


We start with two technical lemmas. The first lemma states that the expected time 
to traverse a path of n vertices is O (n?). 


Lemma 4.7 The expected time for a random walk starting at one end of a path of n 
vertices to reach the other end is O (n?). 


Proof: Consider walking from vertex 1 to vertex n in a graph consisting of a single path 
of n vertices. Let hij, 1 < j, be the hitting time of reaching j starting from 7. Now hi2 = 1 
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and 
Pia = 5+ 51 + hiag) = 1+ 3 (Mii t hii) 2<i<n-1. 


Solving for h;;+1 yields the recurrence 
hee = 2 NA 


Solving the recurrence yields 
iba =9= 1, 


To get from 1 to n, you need to first reach 2, then from 2 (eventually) reach 3, then from 
3 (eventually) reach 4, and so on. Thus by linearity of expectation, 


n—1 n—1 
hin =D hija =  (Q-1) 
1 i=1 
n—1 n—1 
=2 Si — y 1 
i=1 #1 
n(n—1) 
=9 Exi 
2 an=) 
= (n= 1}. 


The next lemma shows that the expected time spent at vertex i by a random walk 
from vertex 1 to vertex n in a chain of n vertices is 2(i — 1) for 2 <i <n- 1. 


Lemma 4.8 Consider a random walk from vertex 1 to vertex n in a chain of n vertices. 
Let t(i) be the expected time spent at vertex i. Then 


n— 1 i= 1 
tli)=4 2(n-1) 2<i<n-1 
1 TSR, 


Proof: Now t(n) = 1 since the walk stops when it reaches vertex n. Half of the time when 
the walk is at vertex n — 1 it goes to vertex n. Thus t(n — 1) = 2. Fr 3 <i<n-!1, 
t(i) = $[€(@-—1)+t(i+1)] and t(1) and t(2) satisfy t(1) = t (2) + 1 and t(2) = 
t (1) + 4t (3). Solving for t(i +1) for 3 < i < n — 1 yields 


t(i +1) = 2t(i) — t(i — 1) 


which has solution t(1) = 2(n — i) for 3 < i < n — 1. Then solving for t(2) and t(1) yields 

t(2) = 2 (n — 2) and t (1) = n — 1. Thus, the total time spent at vertices is 

n — 1)(n — 2) 
2 








E A yao! tl=(n—1)*+1 
which is one more than Ain and thus is correct. A 
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clique of 
size n/2 


8 
<Q 








Figure 4.8: Illustration that adding edges to a graph can either increase or decrease 
hitting time. 


Adding edges to a graph might either increase or decrease the hitting time hy. Con- 
sider the graph consisting of a single path of n vertices. Add edges to this graph to get the 
graph in Figure 4.8 consisting of a clique of size n/2 connected to a path of n/2 vertices. 
Then add still more edges to get a clique of size n. Let x be the vertex at the midpoint of 
the original path and let y be the other endpoint of the path consisting of n/2 vertices as 
shown in the figure. In the first graph consisting of a single path of length n, hy, = O (n°). 
In the second graph consisting of a clique of size n/2 along with a path of length n/2, 
hey = O (n*).To see this latter statement, note that starting at x, the walk will go down 
the path towards y and return to x for n/2— 1 times on average before reaching y for the 
first time, by Lemma 4.8. Each time the walk in the path returns to x, with probability 
(n/2 — 1)/(n/2) it enters the clique and thus on average enters the clique O(n) times 
before starting down the path again. Each time it enters the clique, it spends O(n) time 
in the clique before returning to x. It then reenters the clique O(n) times before starting 
down the path to y. Thus, each time the walk returns to x from the path it spends O(n?) 
time in the clique before starting down the path towards y for a total expected time that 
is O(n*) before reaching y. In the third graph, which is the clique of size n, hy = O (n). 
Thus, adding edges first increased h,, from n° to n? and then decreased it to n. 


Hitting time is not symmetric even in the case of undirected graphs. In the graph of 
Figure 4.8, the expected time, h,,, of a random walk from x to y, where x is the vertex of 
attachment and y is the other end vertex of the chain, is O(n*). However, hys is O(n?). 


Commute time 


The commute time, commute(z, y), is the expected time of a random walk starting at 
x reaching y and then returning to x. So commute(z,y) = hay + hys. Think of going 
from home to office and returning home. Note that commute time is symmetric. We now 
relate the commute time to an electrical quantity, the effective resistance. The effective 
resistance between two vertices x and y in an electrical network is the voltage difference 


105 


between xz and y when one unit of current is inserted at vertex x and withdrawn from 
vertex y. 


Theorem 4.9 Given a connected, undirected graph, consider the electrical network where 
each edge of the graph is replaced by a one ohm resistor. Given vertices x and y, the 
commute time, commute(x, y), equals 2mrzy where Tsy is the effective resistance from x 
to y and m is the number of edges in the graph. 


Proof: Insert at each vertex 7 a current equal to the degree d; of vertex 7. The total 
current inserted is 2m where m is the number of edges. Extract from a specific vertex j 
all of this 2m current (note: for this to be legal, the graph must be connected). Let vij 
be the voltage difference from i to j. The current into 2 divides into the d; resistors at 
vertex 7. The current in each resistor is proportional to the voltage across it. Let k bea 
vertex adjacent to 7. Then the current through the resistor between i and k is Vij — Upj, 
the voltage drop across the resistor. The sum of the currents out of i through the resistors 
must equal d;, the current injected into 7. 


d; = y (viz — Ung) = divij — y Ukj- 


k adj k adj 
toi tot 


Solving for vij 





ETA a Dg Eo (4.11) 


Now the hitting time from 7 to 7 is the average time over all paths from 7 to k adjacent 
to ¿1 and then on from k to j. This is given by 


hig => (14 hy). (4.12) 


k adj 
toi 


Subtracting (4.12) from (4.11), gives vj; — hi; = Y ¿(Uri — hkj). Thus, the function of 
kadj * 

i, Vij — hij, is harmonic. Designate vertex j as the only boundary vertex. The value of 
Vij — hij at i = j, namely vj; — hjj, is zero, since both vj; and hj; are zero. So the function 
Vij — hy; must be zero everywhere. Thus, the voltage v;; equals the expected time h,;; from 
i to j. 

To complete the proof of Theorem 4.9, note that hi; = uj; is the voltage from i to 7 
when currents are inserted at all vertices in the graph and extracted at vertex j. If the 


current is extracted from 7 instead of j, then the voltages change and vj; = hj; in the new 
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Insert current at each vertex 
equal to degree of the vertex. 


Extract 2m at vertex j, viz = hij. 


(a) 0) 


Extract current from 7 instead of 7. 
For new voltages vj; = hji. 





2m ? J 2m 

=> => 
Reverse currents in (b). Superpose currents in (a) and (c). o 
For new voltages —Uj¡ = hji 2mMrij = Vig = hij + hji = commute(i, j). 
Since —Uji = Vij, hji = Vij. (d) 


(c) 


Figure 4.9: Illustration of proof that commute(x, y) = 2mr,, where m is the number of 
edges in the undirected graph and r,, is the effective resistance between x and y. 


setup. Finally, reverse all currents in this latter step. The voltages change again and for 
the new voltages —vj; = hji. Since —vji = vij, We get hji = Vij. 


Thus, when a current is inserted at each vertex equal to the degree of the vertex and 
the current is extracted from j, the voltage v;; in this set up equals h;;. When we extract 
the current from 2 instead of j and then reverse all currents, the voltage v;; in this new set 
up equals hji. Now, superpose both situations, i.e., add all the currents and voltages. By 
linearity, for the resulting v;;, which is the sum of the other two %,;'s, is uj; = hi¿+hj¡. All 
currents into or out of the network cancel except the 2m amps injected at ¿ and withdrawn 
at j. Thus, 2mr;; = vj; = hij + hji = commute(i, j) or commute(i, j) = 2mr;; where rij 
is the effective resistance from 7 to j. A 


The following corollary follows from Theorem 4.9 since the effective resistance Tuy is 
less than or equal to one when u and v are connected by an edge. 


Corollary 4.10 If vertices x and y are connected by an edge, then hy + hy, < 2m where 
m is the number of edges in the graph. 


Proof: If x and y are connected by an edge, then the effective resistance rz, is less than 
or equal to one. E 
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Corollary 4.11 For vertices x and y in ann vertex graph, the commute time, commute(x, y), 
is less than or equal to nè. 


Proof: By Theorem 4.9 the commute time is given by the formula commute(z,y) = 
2191 ¿y Where m is the number of edges. In an n vertex graph there exists a path from x 
to y of length at most n. Since the resistance can not be greater than that of any path 
from x to Y, Tay < n. Since the number of edges is at most (3) 


commute(x, y) = 21 < 2 G) =n’, 


While adding edges into a graph can never increase the effective resistance between 
two given nodes x and y, it may increase or decrease the commute time. To see this 
consider three graphs: the graph consisting of a chain of n vertices, the graph of Figure 
4.8, and the clique on n vertices. 


Cover time 


The cover time, cover(x, G) , is the expected time of a random walk starting at vertex x 
in the graph G to reach each vertex at least once. We write cover(1) when G is understood. 
The cover time of an undirected graph G, denoted cover(G), is 


cover(G) = max cover(x, G). 

For cover time of an undirected graph, increasing the number of edges in the graph 
may increase or decrease the cover time depending on the situation. Again consider three 
graphs, a chain of length n which has cover time O(n”), the graph in Figure 4.8 which has 
cover time O(n*), and the complete graph on n vertices which has cover time O(n logn). 
Adding edges to the chain of length n to create the graph in Figure 4.8 increases the 
cover time from n? to n? and then adding even more edges to obtain the complete graph 
reduces the cover time to nlog n. 


Note: The cover time of a clique is O(nlogn) since this is the time to select every 
integer out of n integers with high probability, drawing integers at random. This is called 
the coupon collector problem. The cover time for a straight line is O(n?) since it is the 
same as the hitting time. For the graph in Figure 4.8, the cover time is O(n) since one 
takes the maximum over all start states and cover(x,G) = O (n?) where x is the vertex 
of attachment. 


Theorem 4.12 Let G be a connected graph with n vertices and m edges. The time for a 
random walk to cover all vertices of the graph G is bounded above by 4m(n — 1). 


108 


Proof: Consider a depth first search of the graph G starting from some vertex z and let 
T be the resulting depth first search spanning tree of G. The depth first search covers 
every vertex. Consider the expected time to cover every vertex in the order visited by the 
depth first search. Clearly this bounds the cover time of G starting from vertex z. Note 
that each edge in T' is traversed twice, once in each direction. 


cover (z,G) < ` liiy 


(x,y)ET 
(y, 2)eT 


If (x,y) is an edge in T, then x and y are adjacent and thus Corollary 4.10 implies 
hay < 2m. Since there are n — 1 edges in the dfs tree and each edge is traversed twice, 
once in each direction, cover(z) < 4m(n — 1). This holds for all starting vertices z. Thus, 
cover(G) < 4m(n — 1). E 


The theorem gives the correct answer of n* for the n/2 clique with the n/2 tail. It 
gives an upper bound of n° for the n-clique where the actual cover time is n log n. 


Let rz, be the effective resistance from x to y. Define the resistance rep (G) of a graph 
G by rep (G) = max(rzy). 
zy 


Theorem 4.13 Let G be an undirected graph with m edges. Then the cover time for G 
is bounded by the following inequality 


mrey[G) < cover(G) < bemreg(G)Inn+n 
where e © 2.718 is Euler's constant and reg(G) is the resistance of G. 


Proof: By definition rey(G) = max(r;y). Let u and v be the vertices of G for which 
vy 


roy is maximum. Then reg¢(G) = fw. By Theorem 4.9, commute(u, v) = 2mry,. Hence 
MTs = ¿commute(u, v). Note that ¿commute(u, v) is the average of hy, and hyu, which 
is clearly less than or equal to max(huw, hvu). Finally, max(Auv, Avu) is less than or equal 
to max(cover(u, G), cover(v, G)) which is clearly less than the cover time of Œ. Putting 
these facts together gives the first inequality in the theorem. 


Mite G) =n = Scommute(u, v) < max(hw, hvu) < cover(G) 


For the second inequality in the theorem, by Theorem 4.9, for any x and y, commute(z, y) 
equals 2mr,, which is less than or equal to 2mreg(G), implying hay < 2mreg(G). By 
the Markov inequality, since the expected time to reach y starting at any x is less than 
2mreg(G), the probability that y is not reached from x in 2mreg(G)e steps is at most 
L, Thus, the probability that a vertex y has not been reached in 6e mreg(G) logn steps 


is at most i = 4 because a random walk of length 6e mreg(G)logn is a sequence of 
3logn random walks, each of length 2emrep (G) and each possibly starting from different 
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vertices. Suppose after a walk of 6emr.g(G)logn steps, vertices v1,v3,...,v had not 
been reached. Walk until vı is reached, then va, etc. By Corollary 4.11 the expected time 
for each of these is n*, but since each happens only with probability 1/n*, we effectively 
take O(1) time per v,, for a total time at most n. More precisely, 


cover(G) < 6emr ey (G) logn + y Prob (v was not visited in the first 6emreg(G) steps) n? 
1 
< 6emrey(G) logn + y =n? < bemreøl(G) +n. 


n 
v 





4.7 Random Walks in Euclidean Space 


Many physical processes such as Brownian motion are modeled by random walks. 
Random walks in Euclidean d-space consisting of fixed length steps parallel to the co- 
ordinate axes are really random walks on a d-dimensional lattice and are a special case 
of random walks on graphs. In a random walk on a graph, at each time unit an edge 
from the current vertex is selected at random and the walk proceeds to the adjacent vertex. 


Random walks on lattices 


We now apply the analogy between random walks and current to lattices. Consider 
a random walk on a finite segment —n,...,—1,0,1,2,...,n of a one dimensional lattice 
starting from the origin. Is the walk certain to return to the origin or is there some prob- 
ability that it will escape, i.e., reach the boundary before returning? The probability of 
reaching the boundary before returning to the origin is called the escape probability. We 
shall be interested in this quantity as n goes to infinity. 


Convert the lattice to an electrical network by replacing each edge with a one ohm 
resistor. Then the probability of a walk starting at the origin reaching n or —n before 
returning to the origin is the escape probability given by 


Ceff 
Ca 





Pescape = 


where Ce is the effective conductance between the origin and the boundary points and ca 
is the sum of the conductances at the origin. In a d-dimensional lattice, Ca = 2d assuming 
that the resistors have value one. For the d-dimensional lattice 


> 1 
Pescape = 2d reg 


In one dimension, the electrical network is just two series connections of n one-ohm re- 
sistors connected in parallel. So as n goes to infinity, reg goes to infinity and the escape 
probability goes to zero as n goes to infinity. Thus, the walk in the unbounded one 
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Figure 4.10: 2-dimensional lattice along with the linear network resulting from shorting 
resistors on the concentric squares about the origin. 


dimensional lattice will return to the origin with probability one. Note, however, that 
the expected time to return to the origin having taken one step away, which is equal to 
commute(1,0), is infinite (Theorem 4.9). 


Two dimensions 


For the 2-dimensional lattice, consider a larger and larger square about the origin for 
the boundary as shown in Figure 4.10a and consider the limit of reg as the squares get 
larger. Shorting the resistors on each square can only reduce rep. Shorting the resistors 
results in the linear network shown in Figure 4.10b. As the paths get longer, the number 
of resistors in parallel also increases. The resistance between vertex 7 and i + 1 is really 
4(2i +1) unit resistors in parallel. The effective resistance of 4(2 +1) resistors in parallel 
is 1/4(2i + 1). Thus, 


rep PatbtHet =30+5+1+---) =0(Inn). 
Since the lower bound on the effective resistance and hence the effective resistance goes 
to infinity, the escape probability goes to zero for the 2-dimensional lattice. 


Three dimensions 


In three dimensions, the resistance along any path to infinity grows to infinity but 
the number of paths in parallel also grows to infinity. It turns out there are a sufficient 
number of paths that rep remains finite and thus there is a non-zero escape probability. 
We will prove this now. First note that shorting any edge decreases the resistance, so 
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Figure 4.11: Paths in a 2-dimensional lattice obtained from the 3-dimensional construc- 
tion applied in 2-dimensions. 


we do not use shorting in this proof, since we seek to prove an upper bound on the 
resistance. Instead we remove some edges, which increases their resistance to infinity and 
hence increases the effective resistance, giving an upper bound. To simplify things we 
consider walks on a quadrant rather than the full grid. The resistance to infinity derived 
from only the quadrant is an upper bound on the resistance of the full grid. 


The construction used in three dimensions is easier to explain first in two dimensions, 
see Figure 4.11. Draw dotted diagonal lines at x + y = 2” — 1. Consider two paths 
that start at the origin. One goes up and the other goes to the right. Each time a path 
encounters a dotted diagonal line, split the path into two, one which goes right and the 
other up. Where two paths cross, split the vertex into two, keeping the paths separate. By 
a symmetry argument, splitting the vertex does not change the resistance of the network. 
Remove all resistors except those on these paths. The resistance of the original network is 
less than that of the tree produced by this process since removing a resistor is equivalent 
to increasing its resistance to infinity. 


The distances between splits increase and are 1, 2, 4, etc. At each split the number 
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Figure 4.12: Paths obtained from 2-dimensional lattice. Distances between splits double 
as do the number of parallel paths. 


of paths in parallel doubles. See Figure 4.12. Thus, the resistance to infinity in this two 
dimensional example is 





In the analogous three dimensional construction, paths go up, to the right, and out of 
the plane of the paper. The paths split three ways at planes given by x +y +z = 2”—1. 
Each time the paths split the number of parallel segments triple. Segments of the paths 
between splits are of length 1, 2, 4, etc. and the resistance of the segments are equal to 
the lengths. The resistance out to infinity for the tree is 





1 1 1 1 2 4 = halo 2k = 
a teo a a 771 


The resistance of the three dimensional lattice is less. It is important to check that the 
paths are edge-disjoint and so the tree is a subgraph of the lattice. Going to a subgraph is 
equivalent to deleting edges which increases the resistance. That is why the resistance of 
the lattice is less than that of the tree. Thus, in three dimensions the escape probability 
is non-zero. The upper bound on reg gives the lower bound 





ie 
Tef T 


Ole 


1 
Pescape = 24 


A lower bound on reg gives an upper bound on Pescape- To get the upper bound on 
Pescape, Short all resistors on surfaces of boxes at distances 1, 2, 3,, etc. Then 





ra2 ¿ri d+ ]>12>02 


1 
9 7 25 ! 


This gives 





aa cla | 
Pescape = 2d ref < 


OJO 


4.8 The Web as a Markov Chain 


A modern application of random walks on directed graphs comes from trying to estab- 
lish the importance of pages on the World Wide Web. Search Engines output an ordered 
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10.857; 1, = 0.857 py + Sm 





Ti = 1.487; ji 
0.157; N, 0.157; ao 


Figure 4.13: Impact on pagerank of adding a self loop 


list of webpages in response to each search query. To do this, they have to solve two 
problems at query time: (i) find the set of all webpages containing the query term(s) and 
(ii) rank the webpages and display them (or the top subset of them) in ranked order. (i) 
is done by maintaining a “reverse index” which we do not discuss here. (ii) cannot be 
done at query time since this would make the response too slow. So Search Engines rank 
the entire set of webpages (in the billions) “off-line” and use that single ranking for all 
queries. At query time, the webpages containing the query terms(s) are displayed in this 
ranked order. 

One way to do this ranking would be to take a random walk on the web viewed as a 
directed graph (which we call the web graph) with an edge corresponding to each hyper- 
text link and rank pages according to their stationary probability. Hypertext links are 
one-way and the web graph may not be strongly connected. Indeed, for a node at the 
“bottom” level there may be no out-edges. When the walk encounters this vertex the 
walk disappears. Another difficulty is that a vertex or a strongly connected component 
with no in edges is never reached. One way to resolve these difficulties is to introduce 
a random restart condition. At each step, with some probability r, jump to a vertex se- 
lected uniformly at random in the entire graph; with probability 1 — r select an out-edge 
at random from the current node and follow it. If a vertex has no out edges, the value 
of r for that vertex is set to one. This makes the graph strongly connected so that the 
stationary probabilities exist. 


Pagerank 


The pagerank of a vertex in a directed graph is the stationary probability of the vertex, 
where we assume a positive restart probability of say r = 0.15. The restart ensures that 
the graph is strongly connected. The pagerank of a page is the frequency with which the 
page will be visited over a long period of time. If the pagerank is p, then the expected 
time between visits or return time is 1/p. Notice that one can increase the pagerank of a 
page by reducing the return time and this can be done by creating short cycles. 


Consider a vertex 7 with a single edge in from vertex j and a single edge out. The 
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stationary probability m satisfies mP = m, and thus 
Ti = NjPji- 
Adding a self-loop at 7, results in a new equation 
1 
Ti = TiPji + ya 
or 
Ti = 2 TiPji- 


Of course, 7; would have changed too, but ignoring this for now, pagerank is doubled by 
the addition of a self-loop. Adding k self loops, results in the equation 


k 
Ti = TjPji + pa 
and again ignoring the change in mj, we now have m; = (k + 1)a;p;;. What prevents 
one from increasing the pagerank of a page arbitrarily? The answer is the restart. We 
neglected the 0.15 probability that is taken off for the random restart. With the restart 
taken into account, the equation for 7; when there is no self-loop is 


Ti = 0.857 Di 


whereas, with k self-loops, the equation is 


k 
Ti = 0.857 5D ji + ras TE 


Solving for 7; yields 
0.85k + 0.85 
(= ——— => 
0.15k + 1 
which for k = 1 is m; = 1.487;p;; and in the limit as k — oo is m; = 5.677;p5;. Adding a 
single loop only increases pagerank by a factor of 1.74. 


TiPji 


Relation to Hitting time 


Recall the definition of hitting time hgy, which for two states x and y is the expected 
time to reach y starting from x. Here, we deal with h,, the average time to hit y, starting 
at a random node. Namely, h, = i5, hay, Where the sum is taken over all n nodes z. 
Hitting time hy is closely related to return time and thus to the reciprocal of page rank. 
Return time is clearly less than the expected time until a restart plus hitting time. With 
r as the restart value, this gives: 


1 
Return time to y < — + hy. 
r 
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In the other direction, the fastest one could return would be if there were only paths of 
length two (assume we remove all self-loops). A path of length two would be traversed 
with at most probability (1 — r)?. With probability r + (1 — r) r = (2—r)r one restarts 
and then hits v. Thus, the return time is at least 2 (1 — r)? + (2—r)r x (hitting time). 
Combining these two bounds yields 


1 
2 (1 — r)? + (2 — r) r(hitting time) < (return time) < — + (hitting time). 
r 


The relationship between return time and hitting time can be used to see if a vertex has 
unusually high probability of short loops. However, there is no efficient way to compute 
hitting time for all vertices as there is for return time. For a single vertex v, one can 
compute hitting time by removing the edges out of the vertex v for which one is com- 
puting hitting time and then run the pagerank algorithm for the new graph. The hitting 
time for v is the reciprocal of the pagerank in the graph with the edges out of v removed. 
Since computing hitting time for each vertex requires removal of a different set of edges, 
the algorithm only gives the hitting time for one vertex at a time. Since one is probably 
only interested in the hitting time of vertices with low hitting time, an alternative would 
be to use a random walk to estimate the hitting time of low hitting time vertices. 


Spam 


Suppose one has a web page and would like to increase its pagerank by creating other 
web pages with pointers to the original page. The abstract problem is the following. We 
are given a directed graph G and a vertex v whose pagerank we want to increase. We may 
add new vertices to the graph and edges from them to any vertices we want. We can also 
add or delete edges from v. However, we cannot add or delete edges out of other vertices. 


The pagerank of v is the stationary probability for vertex v with random restarts. If 
we delete all existing edges out of v, create a new vertex u and edges (v, u) and (u,v), 
then the pagerank will be increased since any time the random walk reaches v it will be 
captured in the loop v > u > v. A search engine can counter this strategy by more 
frequent random restarts. 


A second method to increase pagerank would be to create a star consisting of the 
vertex v at its center along with a large set of new vertices each with a directed edge to 
v. These new vertices will sometimes be chosen as the target of the random restart and 
hence the vertices increase the probability of the random walk reaching v. This second 
method is countered by reducing the frequency of random restarts. 


Notice that the first technique of capturing the random walk increases pagerank but 
does not effect hitting time. One can negate the impact on pagerank of someone capturing 
the random walk by increasing the frequency of random restarts. The second technique 
of creating a star increases pagerank due to random restarts and decreases hitting time. 
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One can check if the pagerank is high and hitting time is low in which case the pagerank 
is likely to have been artificially inflated by the page capturing the walk with short cycles. 


Personalized pagerank 


In computing pagerank, one uses a restart probability, typically 0.15, in which at each 
step, instead of taking a step in the graph, the walk goes to a vertex selected uniformly 
at random. In personalized pagerank, instead of selecting a vertex uniformly at random, 
one selects a vertex according to a personalized probability distribution. Often the distri- 
bution has probability one for a single vertex and whenever the walk restarts it restarts 
at that vertex. Note that this may make the graph disconnected. 


Algorithm for computing personalized pagerank 


First, consider the normal pagerank. Let a be the restart probability with which the 
random walk jumps to an arbitrary vertex. With probability 1 — a the random walk 
selects a vertex uniformly at random from the set of adjacent vertices. Let p be a row 
vector denoting the pagerank and let A be the adjacency matrix with rows normalized to 
sum to one. Then 

p= £(1,1,...,.)+(1-a)pA 


pif — (1-a)A] == (1,1,...,1) 


p=2(1,1,...,1)[/- (1 — a) A]. 


Thus, in principle, p can be found by computing the inverse of [J — (1 — a)A]~!. But 
this is far from practical since for the whole web one would be dealing with matrices with 
billions of rows and columns. A more practical procedure is to run the random walk and 
observe using the basics of the power method in Chapter 3 that the process converges to 
the solution p. 


For the personalized pagerank, instead of restarting at an arbitrary vertex, the walk 
restarts at a designated vertex. More generally, it may restart in some specified neighbor- 
hood. Suppose the restart selects a vertex using the probability distribution s. Then, in 
the above calculation replace the vector 4 (1,1,...,1) by the vector s. Again, the compu- 
tation could be done by a random walk. But, we wish to do the random walk calculation 
for personalized pagerank quickly since it is to be performed repeatedly. With more care 
this can be done, though we do not describe it here. 


4.9 Bibliographic Notes 


The material on the analogy between random walks on undirected graphs and electrical 
networks is from [DS84] as is the material on random walks in Euclidean space. Addi- 
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tional material on Markov chains can be found in [MR95b], [MUO5], and [per10]. For 
material on Markov Chain Monte Carlo methods see [Jer98] and [Liu01]. 


The use of normalized conductance to prove convergence of Markov Chains is by 
Sinclair and Jerrum, [SJ89] and Alon [Alo86]. A polynomial time bounded Markov chain 
based method for estimating the volume of convex sets was developed by Dyer, Frieze and 
Kannan [DFK91]. 
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4.10 Exercises 


Exercise 4.1 The Fundamental Theorem of Markov chains says that for a connected 
Markov chain, the long-term average distribution a(t) converges to a stationary distribu- 
tion. Does the t step distribution p(t) also converge for every connected Markov Chain? 
Consider the following examples: (i) A two-state chain with pig = po: = 1. (ii) A three 
state chain with p12 = P23 = p31 = 1 and the other p,; = 0. Generalize these examples to 
produce Markov Chains with many states. 


Exercise 4.2 Does lim,,.. a(t) — a(t + 1) = 0 imply that a(t) converges to some value? 
Hint: consider the average cumulative sum of the digits in the sequence 107140°1'° - - - 


Exercise 4.3 What is the stationary probability for the following networks. 


0.4 (> £) 0.6 


0.6 04 0.6 04 0.6 04 0.6 0.% 








a 
OS DA IO 
0.5 (> e S © g )1 
0.5 0.5 0.5 


Exercise 4.4 A Markov chain is said to be symmetric if for alli and j, pij = Pji- What 
is the stationary distribution of a connected symmetric chain? Prove your answer. 


Exercise 4.5 Prove |p — qli = 2» ,(p; — qi)" for probability distributions p and q, 
(Proposition 4.4). 


Exercise 4.6 Let p(x), where x = (x1, £2, ..., £a) xı € {0,1}, be a multivariate probabil- 
ity distribution. For d = 100, how would you estimate the marginal distribution 


p(t) = ` p(£1, Z2, ..., Za)? 


T2, Lg 


Exercise 4.7 Using the Metropolis-Hasting Algorithm create a Markov chain whose sta- 
tionary probability is that given in the following table. Use the 3 x 3 lattice for the under- 
lying graph. 





tial 00 | 01 | 02 | 10] 11] 12] 20 | 21 | 22 
Prob | 1/16 | 1/8 | 1/16 | 1/8 | 1/4 | 1/8 | 1/16 | 1/8 | 1/16 









































Exercise 4.8 Using Gibbs sampling create a 4 x 4 lattice where vertices in rows and 
columns are cliques whose stationary probability is that given in the following table. 
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e 2 3 4 
¡ERE 
16 32 32 16 
2 (3 8 8 z 
3] 8 8 z 
4 lis 3 32 1 





Note by symmetry there are only three types of vertices and only two types of rows or 
columns. 


Exercise 4.9 How would you integrate a high dimensional multivariate polynomial dis- 
tribution over some convex region? 


Exercise 4.10 Given a time-reversible Markov chain, modify the chain as follows. At 
the current state, stay put (no move) with probability 1/2. With the other probability 1/2, 
move as in the old chain. Show that the new chain has the same stationary distribution. 
What happens to the convergence time in this modification? 


Exercise 4.11 Let p be a probability vector (non-negative components adding up to 1) 
on the vertices of a connected graph which is sufficiently large that it cannot be stored in 
a computer. Set pi; (the transition probability from i to j) to p; for alli HA j which are 
adjacent in the graph. Show that the stationary probability vector is p. Is a random walk 
an efficient way to sample according to a probability distribution that is close to p? Think, 
for example, of the graph G being the n-dimensional hypercube with 2” vertices, and p as 
the uniform distribution over those vertices. 


Exercise 4.12 Construct the edge probability for a three state Markov chain where each 
pair of states is connected by an undirected edge so that the stationary probability is 
(3, z, t). Repeat adding a self loop with probability 5 to the vertex with probability E, 

Exercise 4.13 Consider a three state Markov chain with stationary probability (3, a x). 
Consider the Metropolis-Hastings algorithm with G the complete graph on these three 
vertices. For each edge and each direction what is the expected probability that we would 


actually make a move along the edge? 


Exercise 4.14 Consider a distribution p over {0,1}? with p(00) = p(11) = 5 and p(01) = 
p(10) = 0. Give a connected graph on {0,1}? that would be bad for running Metropolis- 
Hastings and a graph that would be good for running Metropolis-Hastings. What would be 
the problem with Gibbs sampling? 

Exercise 4.15 Consider p(x) where x € {0,1}'°° such that p (0) = 3 and p(x) = mie 
for x 4 0. How does Gibbs sampling behave? 
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Exercise 4.16 Given a connected graph G and an integer k how would you generate 
connected subgraphs of G with k vertices with probability proportional to the number of 
edges in the subgraph? A subgraph of G does not need to have all edges of G that join 
vertices of the subgraph. The probabilities need not be exactly proportional to the number 
of edges and you are not expected to prove your algorithm for this problem. 


Exercise 4.17 Suppose one wishes to generate uniformly at random a regular, degree 
three, undirected, not necessarily connected multi-graph with 1,000 vertices. A multi- 
graph may have multiple edges between a pair of vertices and self loops. One decides to 
do this by a Markov Chain Monte Carlo technique. In particular, consider a (very large) 
network where each vertex corresponds to a regular degree three, 1,000 vertex multi-graph. 
For edges, say that the vertices corresponding to two graphs are connected by an edge if 
one graph can be obtained from the other by a flip of a pair of edges. In a flip, a pair of 
edges (a,b) and (c,d) are replaced by (a,c) and (b, d). 


1. Prove that the network whose vertices correspond to the desired graphs is connected. 
That is, for any two 1000-vertex degree-3 multigraphs, it is possible to walk from 
one to the other in this network. 


2. Prove that the stationary probability of the random walk is uniform over all vertices. 
3. Give an upper bound on the diameter of the network. 


4. How would you modify the process if you wanted to uniformly generate connected 
degree three multi-graphs? 


In order to use a random walk to generate the graphs in a reasonable amount of time, the 
random walk must rapidly converge to the stationary probability. Proving this is beyond 
the material in this book. 


Exercise 4.18 Construct, program, and execute an algorithm to estimate the volume of 
a unit radius sphere in 20 dimensions by carrying out a random walk on a 20 dimensional 
grid with 0.1 spacing. 


Wr,y 


D Way” 





Exercise 4.19 For an undirected graph G with edge weights Ws y = Wy,w, Set Pay = 


1. Given an undirected graph with edge probabilities, can you always select edge weights 
that will give rise to the desired probabilities? 


2. Lemma 4.1 states that if for all x and Y, TrPzy = TyPyx, then m is the stationary 
probability. However is the converse true? If m is the stationary probability must 
TaPay = TyPyx. for all x and y? 


3. Given a necessary and sufficient condition that m is the stationary probability. 


4. In an undirected graph where for some x and y TzPry £ TyPyx there is a flow through 
the edges. How can the probability be stationary? 
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5. In an undirected graph where TzPry = TyPyx prove that you can always assign weights 
to edges so that py y = > 
y 





Exercise 4.20 In the graph below what is the probability flow in each edge? 





Exercise 4.21 In computing the e-mixing time, we defined the normalized conductance 


of a subset of vertices S as 
To 
9(s) = Y AS Y Pev 


TES yes 





D(S) is the probability of leaving S if we are at vertices in S according to the marginalized 
stationary probability. Does the following formula give the expected time to leave S? 
1 


16(S) + 2(1 — ®($))®(S) + 3(1 — &(S))?6(S) +--- = FG} 


Prove your answer. 


Exercise 4.22 What is the mixing time for the undirected graphs 


1. Two cliques connected by a single edge? To simplify the problem assume vertices 
have self loops except for the two vertices with the edge connecting the two cliques. 
Thus all vertices have the same degree. 


2. A graph consisting of an n vertex clique plus one additional vertex connected to one 
vertex in the clique. To simplify the problem add self loops to all vertices in the 
clique except for the vertex connected to the additional vertex. 


Exercise 4.23 What is the mixing time for 
1. G(n,p) with p= 20897 


2. A circle with n vertices where at each verter an edge has been added to another 
vertex chosen at random. On average each vertex will have degree four, two circle 
edges, and an edge from that vertex to a vertex chosen at random, and possible some 
edges that are the ends of the random edges from other vertices. 


Exercise 4.24 Find the e-mizing time for a 2-dimensional lattice with n vertices in each 
coordinate direction with a uniform probability distribution. To do this solve the following 
problems. 
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4. 
ð. 


The minimum number of edges leaving a set S of size greater than or equal to n? /4 
18 N. 


The minimum number of edges leaving a set S of size less than or equal to n?/4 is 


[VS]. 
Compute (S) 
Compute ® 


Computer the e-mixing time 


Exercise 4.25 Find the e-mixing time for a d-dimensional lattice with n vertices in each 
coordinate direction with a uniform probability distribution. To do this, solve the following 
problems. 


1. 


de 
8. 


Select a direction say xı and push all elements of S in each column perpendicular 
to x, = 0 as close to xı = 0 as possible. Prove that the number of edges leaving S 
is at least as large as the number leaving the modified version of S. 


. Repeat step one for each direction. Argue that for a direction say x1, as x1 gets 


larger a set in the perpendicular plane is contained in the previous set. 


. Optimize the arrangements of elements in the plane x, = 0 and move elements from 


farthest out plane in to make all planes the same shape as xı = 0 except for some 
leftover elements of S in the last plane. Argue that this does not increase the number 
of edges out. 


. What configurations might we end up with? 


. Argue that for a given size, S has at least as many edges as the modified version of 


S. 


. What is ®(S) for a modified form S? 


What is ® for a d-dimensional lattice? 


What is the e-mixing time? 


Exercise 4.26 


1. 


pa 


What is the set of possible harmonic functions on a connected graph if there are only 
interior vertices and no boundary vertices that supply the boundary condition? 


Let q, be the stationary probability of vertex x in a random walk on an undirected 
graph where all edges at a vertex are equally likely and let d, be the degree of vertex 
x. Show that I is a harmonic function. 
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Figure 4.14: An electrical network of resistors. 


3. Ifthere are multiple harmonic functions when there are no boundary conditions, why 
is the stationary probability of a random walk on an undirected graph unique? 


4. What is the stationary probability of a random walk on an undirected graph? 


Exercise 4.27 In Section 4.5, given an electrical network, we define an associated Markov 
chain such that voltages and currents in the electrical network corresponded to properties 


of the Markov chain. Can we go in the reverse order and for any Markov chain construct 
the equivalent electrical network? 


Exercise 4.28 What is the probability of reaching vertex 1 before verter 5 when starting 
a random walk at verter 4 in each of the following graphs. 


1. 








Exercise 4.29 Consider the electrical resistive network in Figure 4.14 consisting of ver- 
tices connected by resistors. Kirchoff’s law states that the currents at each vertex sum to 
zero. Ohm’s law states that the voltage across a resistor equals the product of the resis- 


tance times the current through it. Using these laws calculate the effective resistance of 
the network. 


Exercise 4.30 Consider the electrical network of Figure 4.15. 
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Figure 4.15: An electrical network of resistors. 


1. Set the voltage at a to one and at b to zero. What are the voltages at c and d? 
. What is the current in the edges a to c, a to d, c to d. cto b and d to b? 


. What is the effective resistance between a and b? 


= Sd 


Convert the electrical network to a graph. What are the edge probabilities at each 
verter so that the probability of a walk starting at c (d) reaches a before b equals the 
voltage at c (the voltage at d).? 


5. What is the probability of a walk starting at c reaching a before b? a walk starting 
at d reaching a before b? 


6. What is the net frequency that a walk from a to b goes through the edge from c to 
d? 


7. What is the probability that a random walk starting at a will return to a before 
reaching b? 


Exercise 4.31 Consider a graph corresponding to an electrical network with vertices a 
and b. Prove directly that Zo must be less than or equal to one. We know that this is the 
escape probability and mist “be at most 1. But, for this exercise, do not use that fact. 





Exercise 4.32 (Thomson's Principle ) The energy dissipated by the resistance of edge xy 
in an electrical network is given by dy) The total energy dissipation in the network 


is E =i ere where the i accounts for the fact that the dissipation in each edge is 
TY 
counted twice in the summation. Show that the actual current distribution is the distribu- 


tion satisfying Ohm’s law that minimizes energy dissipation. 
Exercise 4.33 (Rayleigh’s law) Prove that reducing the value of a resistor in a network 


cannot increase the effective resistance. Prove that increasing the value of a resistor cannot 
decrease the effective resistance. You may use Thomson’s principle Exercise 4.32. 
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Figure 4.17: Three graph 


Exercise 4.34 What is the hitting time hu for two adjacent vertices on a cycle of length 
n? What is the hitting time if the edge (u,v) is removed? 


Exercise 4.35 What is the hitting time hy, for the three graphs in Figure 4.16. 


Exercise 4.36 Show that adding an edge can either increase or decrease hitting time by 
calculating ha, for the three graphs in Figure 4.17. 


Exercise 4.37 Consider the n vertex connected graph shown in Figure 4.18 consisting 
of an edge (u,v) plus a connected graph on n — 1 vertices and m edges. Prove that 
huw = 2m + 1 where m is the number of edges in the n — 1 vertex subgraph. 


Exercise 4.38 Consider a random walk on a clique of size n. What is the expected 
number of steps before a given vertex is reached? 


Exercise 4.39 What is the most general solution to the difference equation t(i + 2) — 


5t(1 + 1) + 6t(i) = 0. How many boundary conditions do you need to make the solution 
unique? 
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n— 1 
vertices 
m edges 






Figure 4.18: A connected graph consisting of n — 1 vertices and m edges along with a 
single edge (u, v). 


Exercise 4.40 Given the difference equation agt(i +k) +ap-ıtli+k-—1)+---+at(iṣ4 
1)+aot(i) = 0 the polynomial apt +-a,_;t*-! +---+a t+ = 0 is called the characteristic 
polynomial. 





1. If the equation has a set of r distinct roots, what is the most general form of the 
solution? 


2. If the roots of the characteristic polynomial are not distinct what is the most general 
form of the solution? 


3. What is the dimension of the solution space? 


4. If the difference equation is not homogeneous (i.e., the right hand side is not 0) and 
f(i) is a specific solution to the non-homogeneous difference equation, what is the 
full set of solutions to the non-homogeneous difference equation? 


Exercise 4.41 Show that adding an edge to a graph can either increase or decrease com- 
mute time. 


Exercise 4.42 Consider the set of integers {1,2,...,n}. 
1. What is the expected number of draws with replacement until the integer 1 is drawn. 


2. What is the expected number of draws with replacement so that every integer is 
drawn? 


Exercise 4.43 For each of the three graphs below what is the return time starting at 
verter A? Express your answer as a function of the number of vertices, n, and then 
express it as a function of the number of edges m. 





A 
A B 
A B 
@—_@—_@—2—® B 
n vertices 
—n-2> 
a b c 
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Exercise 4.44 Suppose that the clique in Exercise 4.43 was replaced by an arbitrary graph 
with m— 1 edges. What would be the return time to A in terms of m, the total number of 
edges. 


Exercise 4.45 Suppose that the clique in Exercise 4.43 was replaed by an arbitrary graph 
with m— d edges and there were d edges from A to the graph. What would be the expected 
length of a random path starting at A and ending at A after returning to A exactly d 
times. 


Exercise 4.46 Given an undirected graph with a component consisting of a single edge 
find two eigenvalues of the Laplacian L = D—A where D is a diagonal matrix with vertex 
degrees on the diagonal and A is the adjacency matrix of the graph. 


Exercise 4.47 A researcher was interested in determining the importance of various 
edges in an undirected graph. He computed the stationary probability for a random walk 
on the graph and let p; be the probability of being at verter i. If verter i was of degree 
di, the frequency that edge (i, j) was traversed from i to j would be Di and the frequency 
that the edge was traversed in the opposite direction would be Pa Thus, he assigned an 


importance of to the edge. What is wrong with his idea? 








1 1 
Di — GPI 


Exercise 4.48 Prove that two independent random walks starting at the origin on a two 
dimensional lattice will eventually meet with probability one. 


Exercise 4.49 Suppose two individuals are flipping balanced coins and each is keeping 
tract of the number of heads minus the number of tails. At some time will both individual’s 
counts be the same? 


Exercise 4.50 Consider the lattice in 2-dimensions. In each square add the two diagonal 
edges. What is the escape probability for the resulting graph? 


Exercise 4.51 Determine by simulation the escape probability for the 3-dimensional lat- 
tice. 


Exercise 4.52 What is the escape probability for a random walk starting at the root of 
an infinite binary tree? 


Exercise 4.53 Consider a random walk on the positive half line, that is the integers 
0,1,2,.... At the origin, always move right one step. At all other integers move right 
with probability 2/3 and left with probability 1/3. What is the escape probability? 


Exercise 4.54 Consider the graphs in Figure 4.19. Calculate the stationary distribution 
for a random walk on each graph and the flow through each edge. What condition holds 
on the flow through edges in the undirected graph? In the directed graph? 
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Figure 4.19: An undirected and a directed graph. 








Exercise 4.55 Create a random directed graph with 200 vertices and roughly eight edges 
per vertex. Add k new vertices and calculate the pagerank with and without directed edges 
from the k added vertices to verter 1. How much does adding the k edges change the 
pagerank of vertices for various values of k and restart frequency? How much does adding 
a loop at verter 1 change the pagerank? To do the experiment carefully one needs to 
consider the pagerank of a vertex to which the star is attached. If it has low pagerank its 
page rank is likely to increase a lot. 


Exercise 4.56 Repeat the experiment in Exercise 4.55 for hitting time. 


Exercise 4.57 Search engines ignore self loops in calculating pagerank. Thus, to increase 
pagerank one needs to resort to loops of length two. By how much can you increase the 
page rank of a page by adding a number of loops of length two? 


Exercise 4.58 Number the vertices of a graph {1,2,...,n}. Define hitting time to be the 
expected time from verter 1. In (2) assume that the vertices in the cycle are sequentially 
numbered. 


1. What is the hitting time for a vertex in a complete directed graph with self loops? 
2. What is the hitting time for a vertex in a directed cycle with n vertices? 


Create exercise relating strongly connected and full rank 
Full rank implies strongly connected. 
Strongly connected does not necessarily imply full rank 


00 1 

00 1 

1 1 0 
Is graph aperiodic iff A; > A2? 
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Exercise 4.59 Using a web browser bring up a web page and look at the source html. 
How would you extract the url's of all hyperlinks on the page if you were doing a crawl 
of the web? With Internet Explorer click on “source” under “view” to access the html 
representation of the web page. With Firefox click on “page source” under “view”. 


Exercise 4.60 Sketch an algorithm to crawl the World Wide Web. There is a time delay 
between the time you seek a page and the time you get it. Thus, you cannot wait until the 
page arrives before starting another fetch. There are conventions that must be obeyed if 
one were to actually do a search. Sites specify information as to how long or which files 
can be searched. Do not attempt an actual search without guidance from a knowledgeable 
person. 
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5 Machine Learning 


5.1 Introduction 


Machine learning algorithms are general purpose tools for generalizing from data. 
They are able to solved problems from many disciplines without detailed domain-specific 
knowledge. To date they have been highly successful for a wide range of tasks including 
computer vision, speech recognition, document classification, automated driving, compu- 
tational science, and decision support. 


The core problem. 

The core problem underlying many machine learning applications is learning a good 
classification rule from labeled data. This problem consists of a domain of interest, called 
the instance space, such as email messages or patient records, and a classification task, such 
as classifying email messages into spam versus non-spam or determining which patients 
will respond well to a given medical treatment. We will typically assume our instance 
space is (0, 1)% or RY, corresponding to data that is described by d Boolean or real-valued 
features. Features for email messages could be the presence or absence of various types 
of words, and features for patient records could be the results of various medical tests. 
Our learning algorithm is given a set of labeled training examples, which are points in 
our instance space along with their correct classification. This training data could be a 
collection of email messages, each labeled as spam or not spam, or a collection of patients, 
each labeled by whether or not they responded well to a given medical treatment. Our 
algorithm aims to use the training examples to produce a classification rule that will per- 
form well on new data. A key feature of machine learning, which distinguishes it from 
other algorithmic tasks, is that our goal is generalization: to use one set of data in order 
to perform well on new data not seen yet. We focus on binary classification where items 
in the domain of interest are classified into two categories (called the positive class and 
the negative class), as in the medical and spam-detection examples above, but nearly all 
the techniques described here also apply to multi-way classification. 














How to learn. 

A high-level approach is to find a “simple” rule with good performance on the training 
data. In the case of classifying email messages, we might find a set of highly indicative 
words such that every spam email in the training data has at least one of these words and 
none of the non-spam emails has any of them. In this case, the rule “if the message has 
any of these words, then it is spam else it is not” would be a simple rule that performs 
well on the training data. We might weight words with positive and negative weights 
such that the total weighted sum of words in the email message is positive on the spam 
emails in the training data, and negative on the non-spam emails. This would produce 
a classification rule called a linear separator. We will then argue that so long as the 
training data is representative of what future data will look like, we can be confident that 
any sufficiently “simple” rule that performs well on the training data will also perform 
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well on future data. To make this into a formal mathematical statement, we need to 
be precise about the meaning of “simple” as well as what it means for training data to 
be “representative” of future data. We will see several notions of complexity, including 
bit-counting and VC-dimension, that will allow us to make mathematical statements of 
this form. 


5.2 The Perceptron Algorithm 


A simple rule in d-dimensional space is a linear separator or half-space. Does a 
weighted sum of feature values exceed a threshold? Such a rule may be thought of as 
being implemented by a threshold gate that takes the feature values as inputs, computes 
their weighted sum and outputs yes or no depending on whether or not the sum is greater 
than the threshold. One could also look at a network of interconnected threshold gates 
called a neural net. Threshold gates are sometimes called perceptrons since one model of 
human perception is that it is done by a neural net in the brain. 


The problem of fitting a half-space or a linear separator consists of n labeled examples 
X1,X2,°°: ,Xn in d-dimensional space. Each example has label +1 or —1. The task is to 
find a d-dimensional vector w, if one exists, and a threshold t such that 


w - x; > t for each x; labelled +1 
w - x; < t for each x; labelled — 1. (5.1) 


A vector-threshold pair, (w,t), satisfying the inequalities is called a “linear separator”. 


The above formulation is a linear program in the unknowns w and t that can be 
solved by a general purpose linear programming algorithm. A simpler algorithm called 
the Perceptron Algorithm can be much faster when there is a feasible solution w with a 
lot of “wiggle room” or margin. 


We begin with a technical modification, adding an extra coordinate to each x; and 
w, writing x; = (x;,1) and w = (w, —t). Suppose l; is the +1 label on x;. Then, the 
inequalities in (5.1) can be rewritten as 


(WX) >0 1<i<n. 


Adding the extra coordinate increased the dimension by one, but now the separator con- 
tains the origin. In particular, for examples of label +1, w- x; > 0 and for examples of 
label —1 w- x; < 0. For simplicity of notation, in the rest of this section, we drop the 
hats, and let x; and w stand for the corresponding X; and w. 


The Perceptron Algorithm is very simple and follows below. It is important to note 
that the weight vector w will be a linear combination of the x. 
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The Perceptron Algorithm 


wt 0 
while there exists x; with x,l;-w < 0, update w — w + Xil; 


The intuition is that correcting w by adding x;l; causes the new (w- x;)l; to be higher 
by x; - xl? = |x;|?. This is good for this x;, but the change may be bad for the other xj. 
The proof below shows that this very simple process yields a solution w with a number 
of steps that depends on the margin of separation of the data. If a weight vector w* 
satisfies (w-x;)l; > 0 for 1 <i < n, the minimum distance of any example x; to the linear 
separator w* - x = 0 is called the margin of the separator. Scale w* so that (w*-x;)l; > 1 
for all i. Then the margin of the separator is at least 1/|w*|. If all points lie inside a 
ball of radius r, then r|w*| is the ratio of the radius of the ball to the margin. Theorem 
5.1 below shows that the number of update steps of the Perceptron Algorithm is at most 
the square of this quantity. Thus, the number of update steps will be small when data 
is separated by a large margin relative to the radius of the smallest enclosing ball of the 
data. 


Theorem 5.1 If there is a wx satisfying (w* - x;)l; > 1 for alli, then the Perceptron 
Algorithm finds a solution w with (w -x;)l; > 0 for alli in at most r?|w*|? updates where 
r = max |x|. 


Proof: Let w* satisfy the “if” condition of the theorem. We will keep track of two 
quantities, w’ w* and |w|?. Each update increases w’ w* by at least one. 


(w+ xili) w* = wi w* +x; l,w* > w"w* +1 


On each update, |w|? increases by at most r?. 
(w+ xili) (w + xili) = |w]? + 2x; liw + [xili]? < Jw]? + Ix]? < |w]? + r?°, 


where the middle inequality comes from the fact that an update is only performed on an 
x; when x; Lw < 0. 


If the Perceptron Algorithm makes m updates, then w“w* > m, and |w|? < mr?, or 
equivalently |w||w*| > m and |w| < ry/m. Then 
m < |w||w"] 
m/|w*| < |w] 
m/|w*| < rym 
vm < r|w*] 
m < r’jw*]? 


as desired. E 
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5.3 Kernel Functions and Non-linearly Separable Data 


If the data is linearly separable, then the Perceptron Algorithm finds a solution. If the 
data is not linearly separable, then perhaps one can map the data to a higher dimensional 
space where it is linearly separable. Consider 2- dimensional data consisting of two circular 
rings of radius one and two. Clearly the data is not linearly separable. The mapping 


(x,y) 3 (2,4, 2° + y?) 


would move the ring of radius two farther from the plane than the ring of radius one 
making the mapped data linearly separable. 


If a function y maps the data to another space, one can run the Perceptron Algorithm 
in the new space. The weight vector will be a linear function »>;_, cip(xi) of the new 
input data. To determine if a pattern x; is correctly classified compute 


w"p(x;) = 2 cip(xi)” p(xj). 


Since only products of the y(x;) appear, we do not need to explicitly compute, or even 
to know, the mapping vy if we have a function 


k(xi, xj) = e(x)*p(x;) 


called a kernel function. To add y(x;) to the weight vector, instead of computing the 
mapping y, just add one to the coefficient c;. Given a collection of examples X1,X2,...,Xn 
and a kernel function k, the associated kernel matrix K is defined as ki; = f(x)" p(x}). 
A natural question here is for a given matrix K, how can one tell if there exists a function 
y such that ki; = y(x)" (xj). The following lemma resolves this issue. 


Lemma 5.2 A matriz K is a kernel matriz, i.e., there is a function p such that kj; = 
(xi) y(x;), if and only if K is positive semidefinite. 

Proof: If K is positive semidefinite, then it can be expressed as K = BBT. Define y(xj)" 
to be the i row of B. Then ky = y(xi)’ p(x). Conversely, if there is an embedding y 
such that ki; = p(xi) p(x;), then using the y(x;)" for the rows of a matrix B, K = BBT 
and so K is positive semidefinite. A 


Many different pairwise functions are legal kernel functions. One easy way to create a 
kernel function is by combining other kernel functions together, via the following theorem. 


Theorem 5.3 Suppose kı and ka are kernel functions. Then 


1. For any constant c > 0, ck, is a legal kernel. In fact, for any scalar function f, the 
function k3(x,y) = f(x) f(y)ki(x, y) is a legal kernel. 


2. The sum kı + ko, is a legal kernel. 
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Figure 5.1: Data that is not linearly separable in the input space R? but that is linearly 
separable in the “y-space,” y(x) = (1, V2x1, V2r2, £2, V2£1£2, 73), corresponding to the 
kernel function k(x, y) = (1 + 21Y1 + 2242)”. 





3. The product, k¡ka, is a legal kernel. 


The proof of Theorem 5.3 is relegated to Exercise 5.6. Notice that Theorem 5.3 imme- 
diately implies that the function k(x, y) = (1 + x*y)* is a legal kernel by using the fact 
that k,(x, y) = 1 is a legal kernel, k(x, y) = xy is a legal kernel, then adding them, 
and multiplying the sum by itself k times. 


Another popular kernel is the Gaussian kernel, defined as: 


k(x, y) = eI, 
If we think of a kernel as a measure of similarity, then this kernel defines the similarity 
between two data objects as a quantity that decreases exponentially with the squared 
distance between them. The Gaussian kernel can be shown to be a true kernel function 
by first writing it as 

E EIN e 


for f(x) = el” and then taking the Taylor expansion of eY and applying the rules 
in Theorem 5.3. Technically, this last step requires considering countably infinitely many 
applications of the rules and allowing for infinite-dimensional vector spaces. 


5.4 Generalizing to New Data 


So far, we have focused on finding an algorithm that performs well on a given set S 
of training data. But what we really want is an algorithm that performs well on new 
data not seen yet. To make guarantees of this form, we need some assumption that our 
training data is representative of what new data will look like. Formally, we assume that 
the training data and the new data are drawn from the same probability distribution. 
Additionally, we want our algorithm's classification rule to be simple. These two condi- 
tions allow us to guarantee the ability of our trained algorithm to perform well on new 
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unseen data. 


Formalizing the problem. To formalize the learning problem, assume there is some 
probability distribution D over the instance space X, such that 


1. our training set S consists of points drawn independently at random from D, and 
2. our objective is to predict well on new points that are also drawn from D. 


This is the sense in which we assume that our training data is representative of future 
data. Let c*, called the target concept, denote the subset of X corresponding to the posi- 
tive class for a desired binary classification. For example, c* would correspond to the set 
of all patients who would respond well to a given treatment in a medical scenario, or it 
could correspond to the set of all possible spam emails in a spam-detection scenario. Each 
point in our training set S is labeled according to whether or not it belongs to c*. Our 
goal is to produce a set h C X, called our hypothesis, that is close to c* with respect to the 
distribution D. The true error of h is errp(h) = Prob(hAc*) where A denotes symmetric 
difference, and the probability mass is according to D. In other words, the true error of 
h is the probability it incorrectly classifies a data point drawn at random from D. Our 
goal is to produce an h of low true error. The training error of h, denoted errs(h), is the 
fraction of points in S on which h and c* disagree. That is, errs(h) = |S A (hAc*)|/|S|. 
Note that even though S is assumed to consist of points randomly drawn from D, it is 
possible for an hypothesis h to have low training error or even to completely agree with 
c* over the training sample, and yet have high true error. This is called overfitting the 
training data. For instance, a hypothesis h that simply consists of listing the positive 
examples in S, which is equivalent to an algorithm that memorizes the training sample 
and predicts positive on an example if and only if it already appeared positively in the 
training sample, would have zero training error. However, this hypothesis likely would 
have high true error and therefore would be highly overfitting the training data. More 
generally, overfitting is a concern because algorithms will typically be optimizing over the 
training sample. To design and analyze algorithms for learning, we need to address the 
issue of overfitting. 


To analyze overfitting, we introduce the notion of an hypothesis class, also called a 
concept class or set system. An hypothesis class H over X is a collection of subsets of 
X, called hypotheses or concepts. For instance, the class of intervals over X = R is the 
collection (a, b]la < b}. The class of linear separators over R? is the collection 





















































{{x € RU w-x > t}|w € R? t € R}. 











It is the collection of all sets in Rê that are linearly separable from their complement. In 
the case that X is the set of four points in the plane {(—1,—1), (—1, 1), 4,—1), (1, 1)}, 
the class of linear separators contains 14 of the 2* = 16 possible subsets of X.!8 Given 








18The only two subsets that are not in the class are the sets {(—1,—1), (1,1)} and {(—1, 1), (1, —1)}- 
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Figure 5.2: The hypothesis h; disagrees with the truth in one quarter of the emails. 
Thus with a training set S, the probability that the hypothesis will survive is (1 —0.25)!*1. 
The arrows represent three elements in S, one of which contributes to the training error. 


an hypothesis class H and training set S, we aim to find an hypothesis in H that closely 
agrees with c* over S. For example, the Perceptron Algorithm will find a linear separator 
that agrees with the target concept over S so long as S is linearly separable. To address 
overfitting, we argue that if the training set S is large enough compared to some property 
of H, then with high probability all h € H have their training error close to their true 
error, so that if we find a hypothesis whose training error is low, we can be confident its 
true error will be low as well. 


5.4.1 Overfitting and Uniform Convergence 


We now present two generalization guarantees that explain how one can guard against 
overfitting. To keep things simple, assume our hypothesis class H is finite. In the Percep- 
tron Algorithm with d-dimensional data, if we quantize the weights to be 32 bit integers, 
then there are only 2% possible hypotheses. Later, we will see how to extend these results 
to infinite classes as well. Given a class of hypotheses H, the first result states that for 
e greater than zero, so long as the training data set is large compared to 2In(JA), it 
is unlikely any hypothesis h in H with true error greater than e will have zero training 
error. This means that with high probability, any hypothesis that our algorithm finds 
that agrees with the target hypothesis on the training data will have low true error. The 
second result states that if the training data set is large compared to 4 In(|H|), then it is 
unlikely that the training error and true error will differ by more than e for any hypothesis 
in H. 


The basic idea is the following. For h with large true error and an element x € X 
selected at random, there is a reasonable chance that x will belong to the symmetric 
difference hAc*. If the training sample S is large enough with each point drawn indepen- 
dently from X, the chance that S is completely disjoint from hAc* is incredibly small. 
This is for a single hypothesis, h. When H is finite we can apply the union bound over 
all h € H of large true error. We formalize this below. 


Theorem 5.4 Let H be an hypothesis class and let e and 6 be greater than zero. If a 
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training set S of size 
1 
n> ¿Cn |#| +1n(1/9)) 


is drawn from distribution D, then with probability greater than or equal to 1 — ô every 
h in H with true error errp(h) > e has training error errs(h) > 0. Equivalently, with 
probability greater than or equal to 1 — ô, every h € H with training error zero has true 
error less than e. 


Proof: Let h;,h>,... be the hypotheses in H with true error greater than or equal to e. 
These are the hypotheses that we don’t want to output. Consider drawing the sample S 
of size n and let A; be the event that h; has zero training error. Since every h; has true 
error greater than or equal to e 


Prob(A;) < (1—e)”. 


By the union bound over all 7, the probability that any of these h; is consistent with S is 
given by 

Prob (U;A;) < |H|(1—)”. 
Using the fact that (1— €) < e~‘, the probability that any hypothesis in H with true error 
greater than or equal to e has training error zero is at most |H|e~“". Replacing n by the 
sample size bound from the theorem statement, this is at most |H|e~™!*!-™(/9) = 6 as 
desired. E 


The conclusion of Theorem 5.4 is sometimes called a “PAC-learning guarantee” since 
it states that an h € H consistent with the sample is Probably Approximately Correct. 


Theorem 5.4 addressed the case where there exists a hypothesis in H with zero training 
error. What if the best h in H has 5% error on 5? Can we still be confident that its true 
error is low, say at most 10%? For this, we want an analog of Theorem 5.4 that says for 
a sufficiently large training set S, every h € H with high probability has training error 
within +e of the true error. Such a statement is called uniform convergence because we 
are asking that the training set errors converge to their true errors uniformly over all sets 
in H. To see why such a statement should be true for sufficiently large S and a single 
hypothesis h, consider two strings that differ in 10% of the positions and randomly select 
a large sample of positions. The number of positions that differ in the sample will be 
close to 10%. 


To prove uniform convergence bounds, we use a tail inequality for sums of independent 
Bernoulli random variables (i.e., coin tosses). The following is particularly convenient and 
is a variation on the Chernoff bounds in Section 12.5.1 of the appendix. 

Theorem 5.5 (Hoeffding bounds) Let z1, £2,..., £n be independent {0, 1}-valued ran- 
dom variables with Prob(x; = 1) = p. Let s = $, x; (equivalently, flip n coins of bias p 
and let s be the total number of heads). For anyO<a <1, 


ena 


A 


Prob(s/n>p+a) < 
Prob(s/n<p—a) < ene. 
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Theorem 5.5 implies the following uniform convergence analog of Theorem 5.4. 


Theorem 5.6 (Uniform convergence) Let H be a hypothesis class and let e and 6 be 
greater than zero. If a training set S of size 

1 
is drawn from distribution D, then with probability greater than or equal to 1—0, every h 
in H satisfies lerrs(h) — errp(h)| < e. 


Proof: Fix some h € H and let x; be the indicator random variable for the event that h 
makes a mistake on the j'* example in S. The x; are independent (0, 1} random variables 
and the probability that x; equals 1 is the true error of h, and the fraction of the x,'s equal 
to 1 is exactly the training error of h. Therefore, Hoeffding bounds guarantee that the 
probability of the event A, that Jerrp(h) — errs(h)| > e is less than or equal to 2e72"". 
Applying the union bound to the events A, over all h € H, the probability that there 
exists an h € H with the difference between true error and empirical error greater than e 
is less than or equal to 2/H|e2né: Using the value of n from the theorem statement, the 
right-hand-side of the above inequality is at most 6 as desired. A 


Theorem 5.6 justifies the approach of optimizing over our training sample S even if we 
are not able to find a rule of zero training error. If our training set S is sufficiently large, 
with high probability, good performance on S will translate to good performance on D. 


Note that Theorems 5.4 and 5.6 require |H| to be finite in order to be meaningful. 
The notion of growth functions and VC-dimension in Section 5.5 extend Theorem 5.6 to 
certain infinite hypothesis classes. 


5.4.2 Occam’s Razor 


Occam's razor is the notion, stated by William of Occam around AD 1320, that in 
general one should prefer simpler explanations over more complicated ones.*? Why should 
one do this, and can we make a formal claim about why this is a good idea? What if 
each of us disagrees about precisely which explanations are simpler than others? It turns 
out we can use Theorem 5.4 to make a mathematical statement of Occam's razor that 
addresses these issues. 


What do we mean by a rule being “simple”? Assume that each of us has some way of 
describing rules, using bits. The methods, also called description languages, used by each 
of us may be different, but one fact is that in any given description language, there are at 
most 2” rules that can be described using fewer than b bits (because 14+24+4+...+2%-1 < 
2°). Therefore, by setting H to be the set of all rules that can be described in fewer than 
b bits and plugging into Theorem 5.4, we have the following: 





19The statement more explicitly was that “Entities should not be multiplied unnecessarily.” 
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Theorem 5.7 (Occam’s razor) Fix any description language and consider a training 
sample S drawn from distribution D. With probability at least 1 — ô, any rule h with 
errg(h) = 0 that can be described using fewer than b bits will have errp(h) < e for 
|S| = +[bln(2) + In(1/6)]. Equivalently, with probability at least 1 — ô, all rules with 


errg(h) = 0 that can be described in fewer than b bits will have errp(h) < ee 


Using the fact that In(2) < 1 and ignoring the low-order In(1/6) term, this means that if 
the number of bits it takes to write down a rule consistent with the training data is at 
most 10% of the number of data points in our sample, then we can be confident it will 
have error at most 10% with respect to D. What is perhaps surprising about this theorem 
is that it means that we can each have different ways of describing rules and yet all use 
Occam’s razor. Note that the theorem does not say that complicated rules are necessarily 
bad, or even that given two rules consistent with the data that the complicated rule is 
necessarily worse. What it does say is that Occam’s razor is a good policy in that simple 
rules are unlikely to fool us since there are just not that many simple rules. 


5.4.3 Regularization: Penalizing Complexity 


Theorems 5.6 and 5.7 suggest the following idea. Suppose that there is no simple rule 
that is perfectly consistent with the training data, but there are very simple rules with 
training error 20%, and some more complex rules with training error 10%, and so on. In 
this case, perhaps we should optimize some combination of training error and simplicity. 
This is the notion of regularization, also called complexity penalization. 


Specifically, a regularizer is a penalty term that penalizes more complex hypotheses. 
So far, a natural measure of complexity of a hypothesis is the number of bits needed 
to write it down.? Consider fixing some description language, and let H; denote those 
hypotheses that can be described in i bits in this language, so |H,| < 2. Let 6; = 9/2*. 


Rearranging the bound of Theorem 5.6, with probability at least 1—0;, all h in H; satisfy 
err p(h) < errs(h) 4 ae A, Applying the union bound over all ¿, using the fact 
that ô = 01 + 02+03+..., and also the fact that In(|H;|) + In(2/d;) < ¿1n(4) + In(2/0), 


gives the following corollary. 





Corollary 5.8 Fix any description language and a training sample S. With probability 
greater than or equal to 1 — ô, all hypotheses h satisfy 





O OE E E n/a) 





where size(h) denotes the number of bits needed to describe h in the given language. 





20Later we will see support vector machines that use a regularizer for linear separators based on the 
margin of separation of data. 
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Corollary 5.8 tells us that rather than searching for a rule of low training error, we may 
want to search for a rule with a low right-hand-side in the displayed formula. If we can 
find an hypothesis for which this quantity is small, we can be confident true error will be 
low as well. 


To see how this works, consider a complex network made up with many threshold 
logic units. It terns out one can give a description language such that networks with 
many weights set to zero have smaller size than networks with fewer weights set to zero. 
Thus, if we use a regularizer that penalizes weights that are non-zero in the training, the 
training will change the size of the network by setting some weights to zero. Given a 
very simple network with training error 20% and possibly true error 21% or 22% and a 
complex network with training error 10% but true error 21% or 22% we can train with 
the regularizer and maybe get training error 15% and true error 18%. 


5.5 VC-Dimension 


In Section 5.4.1 we showed that if the training set S is large compared to 2log([H|), 
we can be confident that every h in H with true error greater than or equal to e will 
have training error greater than zero. If S is large compared to 5 log(|H|), then we can 
be confident that every h in H will have |errp(h) — errs(h)| < e. These results used 
log(|H|) as a measure of complexity of the concept class H, which required that H be a 
finite set. VC-dimension is a tighter measure of complexity for a concept class and also 
yields confidence bounds. For any class H, VC-dim(H) < log2(/H)),% but it can also be 
quite a bit smaller and is finite in some cases where H is infinite. 


This issue of how big a sample is required comes up in many application besides 
learning theory. One such application is how large a sample of a data set do we need 
to insure that using the sample to answer questions give a reliable answer with high 
probability. The answer relies on the complexity of the class of questions which in some 
sense corresponds to how sophisticated the learning algorithm is. To introduces the con- 
cept of VC-dimension we will consider sampling a data base instead of training a network. 


Consider a database consisting of the salary and age for a random sample of the adult 
population in the United States. Suppose we are interested in using the database to an- 
swer questions of the form: “what fraction of the adult population in the United States 
has age between 35 and 45 and salary between $50,000 and $70,000?” If the data is plot- 
ted in 2-dimensions, we are interested in queries that ask about the fraction of the adult 
population within some axis-parallel rectangle. To answer a query, calculate the fraction 





2!The definition of VC-dimension is that if VCdim(H) = d then there exist d points 71, ...,2q such that 
all ways of labeling them are achievable using hypotheses in H. For each way of labeling them, choose 
some hypothesis h; in H that agrees with that labeling. Notice that the hypotheses hı, ha, ..., haa must 
all be distinct, because they disagree on their labeling of £1, ..., xa. Therefore, |H| > 2%. This means that 
logs (|H|) > d. 


141 


of the database satisfying the query. This brings up the question, how large does our 
database need to be so that with probability greater than or equal to 1 — 6, our answer 
will be within +e of the truth for every possible rectangle query of this form? 


If we assume our values are discretized such as 100 possible ages and 1,000 possible 
salaries, then there are at most (100 x 1,000)? = 10% possible rectangles. This means we 
can apply Theorem 5.6 with [H| < 10' and a sample size of 25(10 In 10 + In(2/9)) would 
be sufficient. 


If there are only n adults in the United States there are at most n* rectangles that are 
truly different and so we could use |H| < nt. Still, this suggests that S needs to grow with 
n, albeit logarithmically, and one might wonder if that is really necessary. VC-dimension, 
and the notion of the growth function of concept class H, will give us a way to avoid 
such discretization and avoid any dependence on the size of the support of the underlying 
distribution D. 


5.5.1 Definitions and Key Theorems 
Definition 5.1 A set system (X, H) consists of a set X and a class H of subsets of X. 


In learning theory, the set X is the instance space, such as all possible emails, and H 
is the class of potential hypotheses, where a hypothesis h is a subset of X, such as the 
emails that our algorithm chooses to classify as spam. 


An important concept in set systems is shattering. 


Definition 5.2 A set system (X,H) shatters a set A if each subset of A can be expressed 
as AN h for some h in H. 


Definition 5.3 The VC-dimension of H is the size of the largest set shattered by H. 


For instance, there exist sets of four points in the plane that can be shattered by 
rectangles with axis-parallel edges, e.g., four points at the vertices of a diamond (see 
Figure 5.3). For each of the 16 subsets of the four points, there exists a rectangle with the 
points of the subset inside the rectangle and the remaining points outside the rectangle. 
However, rectangles with axis-parallel edges cannot shatter any set of five points. Assume 
for contradiction that there is a set of five points shattered by the family of axis-parallel 
rectangles. Start with an enclosing rectangle for the five points. Move parallel edges 
towards each other without crossing any point until each edge is stopped by at least one 
point. Identify one such point for each edge. The same point may be identified as stopping 
two edges if it is at a corner of the minimum enclosing rectangle. If two or more points 
have stopped an edge, designate only one as having stopped the edge. Now, at most four 
points have been designated. Any rectangle enclosing the designated points must include 
the undesignated points. Thus, the subset of designated points cannot be expressed as 
the intersection of a rectangle with the five points. Therefore, the VC-dimension of axis- 
parallel rectangles is four. 
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Figure 5.3: (a) shows a set of four points that can be shattered by rectangles along with 
some of the rectangles that shatter the set. Not every set of four points can be shattered 
as seen in (b). Any rectangle containing points A, B, and C must contain D. No set of five 
points can be shattered by rectangles with axis-parallel edges. No set of three collinear 
points can be shattered, since any rectangle that contains the two end points must also 
contain the middle point. More generally, since rectangles are convex, a set with one point 
inside the convex hull of the others cannot be shattered. 


5.5.2 VC-Dimension of Some Set Systems 


Rectangles with axis-parallel edges 
We saw above, the class of axis-parallel rectangles in the plane has VC-dimension four. 


Intervals of the reals 

Intervals on the real line can shatter any set of two points but no set of three points 
since the subset of the first and last points cannot be isolated. Thus, the VC-dimension 
of intervals is two. 


Pairs of intervals of the reals 

Consider the family of pairs of intervals, where a pair of intervals is viewed as the set 
of points that are in at least one of the intervals. There exists a set of size four that can 
be shattered but no set of size five since the subset of first, third, and last point cannot 
be isolated. Thus, the VC-dimension of pairs of intervals is four. 


Finite sets 
The system of finite sets of real numbers can shatter any finite set of real numbers 
and thus the VC-dimension of finite sets is infinite. 


Convex polygons 

For any positive integer n, place n points on the unit circle. Any subset of the points 
are the vertices of a convex polygon. Clearly that polygon does not contain any of the 
points not in the subset. This shows that convex polygons can shatter arbitrarily large 
sets, so the VC-dimension is infinite. 
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Halfspaces in d-dimensions 
Define a halfspace to be the set of all points on one side of a linear separator, i.e., a 
set of the form (x)w*x > t}. The VC-dimension of halfspaces in d-dimensions is d + 1. 


There exists a set of size d+ 1 that can be shattered by halfspaces. Select the d unit- 
coordinate vectors plus the origin to be the d+ 1 points. Suppose A is any subset of these 
d+ 1 points. Without loss of generality assume that the origin is in A. Take a 0-1 vector 
w which has 1’s precisely in the coordinates corresponding to vectors not in A. Clearly 
A lies in the half-space w?x < 0 and the complement of A lies in the complementary 
halfspace. 


We now show that no set of d+ 2 points in d-dimensions can be shattered by half- 
spaces. This is done by proving that any set of d+ 2 points can be partitioned into 
two disjoint subsets A and B whose convex hulls intersect. This establishes the claim 
since any linear separator with A on one side must have its entire convex hull on that 
side,” so it is not possible to have a linear separator with A on one side and B on the other. 


Let convex(S) denote the convex hull of the point set S. 


Theorem 5.9 (Radon): Any set S C R? with |S| > d+ 2, can be partitioned into two 
disjoint subsets A and B such that convex(A) N convex(B) F ¢. 


Proof: Without loss of generality, assume |S| = d+ 2. Form a d x (d + 2) matrix 
with one column for each point of S. Call the matrix A. Add an extra row of all 
l's to construct a (d+ 1) x (d+ 2) matrix B. Clearly the rank of this matrix is at 


most d + 1 and the columns are linearly dependent. Say x = (21,T2,...,Ta+2) is a 

non-zero vector with Bx = 0. Reorder the columns so that 271,%,...,7, > 0 and 

Zs41) Us+y2,---,Taya < 0. Normalize x so ` |z;| = 1. Let b; (respectively a;) be the 
i=l 


s d+2 
it column of B (respectively A). Then, Y |z;|b; = > |z:|b; from which it follows that 
i=1 


¡=5+1 
s d+2 8 d+2 : i r d+2 
Yela = D> zija; and X |e: = X Je]. Since X |2:] = 1 and X |z;| = 1, each 
i=l i=st1 i=1 i=s i=l i=s+1 
E d+2 
side of `` |z;ja; = >» |z;ļa; is a convex combination of columns of A, which proves the 
i=l i=s41 


theorem. Thus, S can be partitioned into two sets, the first consisting of the first s points 
after the rearrangement and the second consisting of points s + 1 through d +2 . Their 
convex hulls intersect as required. E 


Radon’s theorem immediately implies that half-spaces in d-dimensions do not shatter 
any set of d + 2 points. 





221f any two points x, and x lie on the same side of a linear separator, so must any convex combination. 
If w- xı > band w -xə > b then w- (ax; + (1 — a)x2) >b. 
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Spheres in d-dimensions 

A sphere in d-dimensions is a set of points of the form {x| |x — xp| < r}. The VC- 
dimension of spheres is d +1. It is the same as that of halfspaces. First, we prove that no 
set of d +2 points can be shattered by spheres. Suppose some set S with d+2 points can 
be shattered. Then for any partition A; and Az of S, there are spheres Bı and By such 
that Bı N S = A; and By N S = A>. Now Bı and By may intersect, but there is no point 
of S in their intersection. It is easy to see that there is a hyperplane perpendicular to 
the line joining the centers of the two spheres with all of A; on one side and all of 4 on 
the other and this implies that halfspaces shatter S, a contradiction. Therefore no d + 2 
points can be shattered by spheres. 


It is also not difficult to see that the set of d+ 1 points consisting of the unit-coordinate 
vectors and the origin can be shattered by spheres. Suppose A is a subset of the d+ 1 
points. Let a be the number of unit vectors in A. The center ag of our sphere will be 
the sum of the vectors in A. For every unit vector in A, its distance to this center will 
be ya — 1 and for every unit vector outside A, its distance to this center will be ya + 1. 
The distance of the origin to the center is ya. Thus, we can choose the radius so that 
precisely the points in A are in the hypersphere. 


5.5.3 Shatter Function for Set Systems of Bounded VC-Dimension 


For a set system (X, H), the shatter function my (n) is the maximum number of subsets 
of any set A of size n that can be expressed as A N h for h in H. The function my(n) 
equals 2” for n less than or equal to the VC-dimension of H. We will soon see that 
for n greater than the VC-dimension of H, 7(n) grows polynomially with n, with the 
polynomial degree equal to the VC-dimension. Define 


(eu) G++) 


to be the number of ways of choosing d or fewer elements out of n. Note that ( Bs ) oni 
We will show that for any set system (X, H) of VC-dimension d, that my (n) < (2). That 
is, ( E 4) bounds the number of subsets of any n point set A that can be expressed as the 
intersection of A with a set of H. Thus, the shatter function my (n) is either 2” if d is 


infinite or it is bounded by a polynomial of degree d. 





Lemma 5.10 (Sauer) For any set system (X,H) of VC-dimension at most d, my (n) < 
(Za) for all n. 


Proof: First, note the following identity. For n > 1 and d > 1 


moy mal ER m1 
<d} \<d-1 <d j’ 
23To choose between 1 and d elements out of n, for each element there are n possible items if we allow 


duplicates. Thus (7) + (3) +--+ + (4) < nt. The +1 is for (%) 
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This is because to choose d or fewer elements out of n, we can either choose the first 
element, and then choose d— 1 or fewer out of the remaining n— 1, or not choose the first 
element, and then choose d or fewer out of the remaining n — 1. The proof of the lemma 
is by induction on n and d. In particular, the base case will handle all pairs (n, d) with 
either n < d or d = 0, and the general case (n, d) will use the inductive assumption on 
the cases (n — 1,d — 1) and (n — 1, d). 

For the base case n < d, (2,) = 2” and Ty (n) = 2”. For the base case VC-dimension 
d = 0, a set system (X, H) can have at most one set in H since if there were two sets in 
H there would exist a set A consisting of a single element that was contained in one of 
the sets but not in the other and thus could be shattered. Therefore, for d = 0, we have 


myn) = = (o): 


Consider the case for general d and n. Select a subset A of X of size n such that Ty (n) 
subsets of A can be expressed as AN h for h in H. Without loss of generality, assume that 
X = A and replace each set h € H by ANA removing duplicate sets; i.e., if hı QN A = han A 
for hı and ha in H, keep only one of them. Now each set in H corresponds to a subset of 


A and ru (n) = |H]. To show ry(n) < al we only need to show |H| < EAF 


Remove some element u from the set A and from each set in H. Consider the set 
system Hı = (A— {u}, {h — {u}h € H}). For h C A — {u}, if exactly one of h and 
hU {u} is in H, then the set h contributes one set to both H and Hı, whereas, if both 
h and h U {u} are in H, then they together contribute two sets to H, but only one to 
Hı. Thus |H,| is less than |H| by the number of pairs of sets in H that differ only in the 
element u. To account for this difference, define another set system 


Hə = (A — {u}, {h|u ¢ h and both h and hU {u} are in H}). 


Then 
[A] = [Hi] + [Ho] = tu, (n — 1) + ru, (n — 1) 


TH (n) = TH (n = 1) + ru, (n — 1). 


We make use of two facts 

(1) Hı has dimension at most d, and 

(2) Hy has dimension at most d — 1. 
(1) follows because if Hı shatters a set of cardinality d + 1, then H also would shatter 
that set producing a contradiction. (2) follows because if Ha shattered a set B C A— {u} 


with |B| > d, then B U {u} would be shattered by H where |B U {u}| > d+ 1, again 
producing a contradiction. 
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By the induction hypothesis applied to Hı, |Hi| = ma, (n—1) < oa By the 


< (214). Finally, by the identity 


A 


induction hypotheses applied to Ha, |H2| = Ty, (n— 1) 
at the beginning of the proof, we have 


is id ee i ey i a 


as desired. E 


5.5.4 VC-Dimension of Combinations of Concepts 


Often one wants to create concepts out of other concepts. For example, given several 
linear separators, one could take their intersection to create a convex polytope. Or given 
several disjunctions, one might want to take their majority vote. We can use Sauer’s 
lemma, (Lemma 5.10), to show that such combinations do not increase the VC-dimension 
of the class by too much. 


Let (X, Hı) and (X, H2) be two set systems on the same underlying set X. Define 
another set system, called the intersection system, (X, Hı N Aa), where Hi NH. = {hin 
hə|hı € Hi; ha € He}. In other words, take the intersections of every set in Hı with 
every set in Hy. A simple example is U = R? and Hı and Hə are both the set of all half 
spaces. Then Hı N Hə consists of all sets defined by the intersection of two half spaces. 
This corresponds to taking the Boolean AND of the output of two threshold gates and 
is the most basic neural net besides a single gate. We can repeat this process and take 
the intersection of k half spaces. The following simple lemma bounds the growth of the 
shatter function as we do this. 


Lemma 5.11 Suppose (X, Hı) and (X, H2) are two set systems on the same set X. Then 
THiNH2 (n) < TH, (1) TH. (n). 


Proof: Let A C X be a set of size n. We are interested in the size of S =[ANh|he€ 
Hı NH2}. By definition of Hı N Ho, we have S = {AN (hi N h2) | hi € Hi, h2 € Ha), 
which we can rewrite as S = {(ANhi)N (AN hg) | hi € Hi, he € Heo}. Therefore, 
IS] < {AN Ay | hi € Hy} x HAN he | ha € Ho}l, as desired. E 


We can generalize the idea of an intersection system to other ways of combining con- 
cepts. This will be useful later for our discussion of Boosting in Section 5.11 where we 
will be combining hypotheses via majority vote. 


Specifically, given k concepts hı, ho,...,h, and a Boolean function f, define the set 
combs(h1,...,hy) = {x € X|f(hi[x),...,hy(x)) = 1}, where here we are using h;(x) to 
denote the indicator for whether or not x € h;. For example, if f is the AND function 
then combs(hı,..., hp) is the intersection of the h;, or if f is the majority-vote function, 
then comby(hy,...,hy) is the set of points that lie in more than half of the sets h;. The 
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concept comby(hy,...,hy) can also be viewed as a depth-two neural network. Given a 
concept class H, a Boolean function f, and an integer k, define the new concept class 
COMB+r(H) = [comby(hy,...,hy)lh; € H}. The same reasoning used to prove Lemma 
5.11 also gives us the following lemma. 


Lemma 5.12 For any Boolean function f, hypothesis class H, and integer k, 


TCOMBy(H)(N) < run)". 
We can now use Lemma 5.12 to prove the following theorem about the VC-dimension of 
hypothesis classes defined by combining other hypothesis classes. 


Theorem 5.13 If concept class H has VC-dimension V then for any Boolean function 
f and integer k, the class COMBs,(H) has VC-dimension O(kV log(kV)). 


Proof: Let n be the VC-dimension of COMBy;,(H), so by definition, there must exist 
a set © of n points shattered by COMBy;,(H). We know by Sauer's lemma that there 
are at most n“ ways of partitioning the points in S using sets in H. Since each set in 
COMBrx(H) is determined by k sets in H, and there are at most (n”)* = n*Y different 
k-tuples of such sets, this means there are at most n*” ways of partitioning the points 
using sets in COMB¿¿(H). Since S is shattered, we must have 2” < n*Y, or equivalently 
n < kV log,(n). We solve this as follows. First, assuming n > 16 we have log,(n) < yn 
so kV log2(n) < kV yn which implies that n < (kV)?. To get the better bound, plug 
back into the original inequality. Since n < (kV )?, it must be that log,(n) < 2log,(kV). 
substituting logn < 2log,(kV) into n < kV log, n gives n < 2kV log, (kV). E 


5.5.5 The Key Theorem 


Theorem 5.14 Let (X, H) be a set system, D a probability distribution over X, and let 
n be an integer satisfying 


2 1 
n > — |log, 274 (2n) + logs 5 
€ 


Let Sı consists of n points drawn from D. With probability greater than or equal to 1 — ô, 
every set in H of probability mass greater than e intersects Sı. 


Note: n occurs on both sides of the inequality above. If H has finite VC-dimension d, 
this does not lead to circularity since, by Lemma 5.10, log(nry(2n)) = O(dlogn) and an 
inequality of the form n > alogn (for a positive integer a > 4) is implied by n > calna 
for some constant c, thus eliminating n from the right hand side. 


Proof: Let A be the event that there exists a set h in H of probability mass greater than 
or equal to e that is disjoint from Sı. Draw a second set Sa of n points from D. Let B be 
the event that there exists h in H that is disjoint from Sı but that contains at least 5n 
points in S2. That is, 
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B: there exists a set h € H with [51 N h| = 0 but [52 N h| > $n. 


By Chebyshev, Prob(B|A) > 3. In particular, if h is disjoint from S; and has probability 
mass greater than or equal to e, there is at least a i chance that h will contain at least 
an 5 fraction of the points in a new random set S2. This means that 


1 
Prob(B) > Prob(A, B) = Prob(B|A)Prob(A) > ¿Prob(4). 


Therefore, to prove that Prob(A) < 6 it suffices to prove that Prob(B) < 2. For this, we 
consider a second way of picking S; and S2. Draw a random set S3, i.e., of 2n points from 
D, and then randomly partition S3 into two equal pieces; let S be the first piece and S2 
the second. It is obvious that this yields the same probability distribution for Sı and S> 
as picking each independently. 


Now, consider the point in time after S3 has been drawn but before it has been ran- 
domly partitioned into Sı and S2. Even though H may contain infinitely many sets, we 
know it has at most 7 (2n) distinct intersections with S3. That is, {S3 N h|h € H}| < 
Tru (2n). Thus, to prove that Prob(B) < Ž, it is sufficient to prove that for any given 
h’ C S3, the probability over the random partition of S3 into Sı and So that |S Nh’| = 0 
but [52 N h'| > $n is at most EO 

To analyze this, first note that if h’ contains fewer than $n points, it is impossible to 
have |S2 N h'| > $n. For h' larger than $n, the probability over the random partition of 
S3 that none of the points in h’ fall into S4 is at most (9. Plugging in our bound on 


n in the theorem statement we get 


ô 


g—en/2 <27 log 2ry(2n)+logô __ 
T 2ry (2n) 


as desired. Thus, Prob(B) < 2 and Prob(A) < ô. This type of argument where we used 
two ways of picking Si and Sù is called “double sampling” or the “ghost sample” method. 
The key idea is that we postpone certain random choices to the future until after we have 
converted our problem into one of finite size. Double sampling is useful in other contexts 


as well. E 


5.6 VC-dimension and Machine Learning 


We now apply the concept of VC-dimension to machine learning. In machine learning 
we have a target concept c* such as spam emails and a set of hypotheses H which are sets 
of emails we claim are spam. Let H’ = {hAc*|h € H} be the collection of error regions of 
hypotheses in H. Note that H’ and H have the same VC-dimension and shatter function. 
We now draw a training sample S of emails, and apply Theorem 5.14 to H’ to argue 
that with high probability, every h with Prob(hAc*) > e has |S N (hAc*)| > 0. In other 
words, with high probability, only hypotheses of low true error are fully consistent with 
the training sample. This is formalized in the theorem statement below. 
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Theorem 5.15 (sample bound) For any class H and distribution D, if a training 
sample S is drawn from D of size 


2 1 
n > A log(2m4(2n)) + log = 


then with probability greater than or equal to 1— ô, every h € H with true error errp(h) > 


e has errs(h) > 0. Equivalently, every h € H with training error errs(h) = 0 has 
errp(h) < e. 


Proof: The proof follows from Theorem 5.14 applied to H’ = {hAc*|h € A). E 


Theorem 5.16 (Growth function uniform convergence) For any class H and dis- 
tribution D, if a training sample S is drawn from D of size 


8 1 
n > 2 In(2m4(2n)) + In 5 ; 


then with probability greater than or equal to 1—ô, every h € H will have lerrs(h) — errp(h)| < e. 


Proof: This proof is similar to the proof of Theorem 5.15 and 5.14. The main changes 
are that B is defined to be the event that some h has intersections with S; and Sə that 
differ in size by 5n, and then Hoeffding bounds are used to analyze the probability of this 
occurring for a fixed h. E 


Finally, we can apply Lemma 5.10 to write the above theorems in terms of VC-dimension 
rather than the shatter function. We do this for the case of Theorem 5.15; the case of 
Theorem 5.16 is similar. 


Corollary 5.17 For any class H and distribution D, a training sample S of size 


1 1 1 
O G voaim(r log — + log 5) 
€ € ô 
is sufficient to ensure that with probability greater than or equal to 1 — ô, every h € H 
with true error errp(h) > e has training error errs(h) > 0. Equivalently, every h € H 
with errs(h) = 0 has errp(h) < e. 


5.7 Other Measures of Complexity 


VC-dimension and number of bits needed to describe a set are not the only measures 
of complexity one can use to derive generalization guarantees. There has been significant 
work on a variety of measures. One measure, called Rademacher complexity, measures the 
extent to which a given concept class H can fit random noise. Given a set of n examples 
S =([x1,..., Tn], the empirical Rademacher complexity of H is defined as 
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where ø; € [—1, 1} are independent random labels with Prob[o; = 1] = 3. For example, if 
you assign random +1 labels to the points in S and the best classifier in H on average gets 
error 0.45 then Rs(H) = 0.55 — 0.45 = 0.1. One can prove that with probability greater 
than or equal to 1 — 6, every h € H satisfies true error less than or equal to training error 


plus Rs(H) + 3 A For more on results such as this see [BM02]. 


5.8 Deep Learning 


Deep learning, or a deep neural network, refers to training a many-layered network of 
non-linear computational units. Each computational unit or gate works as follows: there 
are a set of “wires” bringing inputs to the gate. Each wire has a “weight” and the gate's 
output is a real number obtained by applying a non-linear “activation function” to the 
the weighted sum of the input values. The activation function is generally the same for 
all gates in the network, though the number of inputs to individual gates may differ. 


The input to the network is an example x € Rĉ. The first layer of the network trans- 
forms the example into a new vector f¡(x). Then the second layer transforms f(x) into 
a new vector fo(f¡(x)), and so on. Finally, the k'” layer outputs the final prediction 


F(x) = ful fal. (f(x)))). 


In supervised learning, we are given training examples X1, X2,..., and corresponding 
labels c*(x1),c*(x2),.... The training process finds a set of weights of all wires so as to 
minimize the error: fo (x1 — (x1)) + (fo(x2) — c*(x2))* +--+. One could alternatively 
aim to minimize other quantities besides the sum of squared errors of training examples. 
Often training is carried out by running stochastic gradient descent in the weights space. 


The motivation for deep learning is that often we are interested in data, such as images, 
that are given to us in terms of very low-level features, such as pixel intensity values. Our 
goal is to achieve some higher-level understanding of each image, such as what objects 
are in the image and what they are doing. To do so, it is natural to first convert the given 
low-level representation into one of higher-level features. That is what the layers of the 
network aim to do. Deep learning is also motivated by multi-task learning, with the idea 
that a good higher-level representation of data should be useful for a wide range of tasks. 
Indeed, a common use of deep learning for multi-task learning is to share initial levels of 
the network across tasks. 


A typical architecture of a deep neural network consists of layers of logic units. In a 
fully connected layer, the output of each gate in the layer is connected to the input of 
every gate in the next layer. However, if the input is an image one might like to recognize 
features independent of where they are located in the image. To achieve this, one often 
uses a number of convolution layers. In a convolution layer, each gate gets inputs from a 
small k x k grid where k may be 5 to 10. There is a gate for each k x k square array of 
the image. The weights on each gate are tied together so that each gate recognizes the 
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Figure 5.4: Convolution layers 


same feature. There will be several such collections of gates, so several different features 
can be learned. Such a level is called a convolution level and the fully connected layers 
are called autoencoder levels. A technique called pooling is used to keep the number of 
gates reasonable. A small k x k grid with k typically set to two is used to scan a layer. 
The stride is set so the grid will provide a non-overlapping cover of the layer. Each k x k 
input grid will be reduced to a single cell by selecting the maximum input value or the 
average of the inputs. For k = 2 this reduces the number of cells by a factor of four. 


Deep learning networks are trained by stochastic gradient descent (Section 5.9.1), 
sometimes called back propagation in the network context. An error function is con- 
structed and the weights are adjusted using the derivative of the error function. This 
requires that the error function be differentiable. A smooth threshold is used such as 


È ari ð e_ ¿=e DE 2 
tanh(x) = ae where tes = pe (5) 








ert ere dx et +e-* ai al 
. e 
or sigmod(x) = zy where 
Osi d =r —2 
E (2) = a a = sigmod(x) i . = sigmoid(x) (1 — sigmoid(z)). 
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Figure 5.5: A convolution level, top and AlexNet consisting of five convolution levels, 
followed by three fully connected levels, and then Softmax 


In fact, the function 


ReLU (x) = maz(0, x) 
ol Ses ae oe 
- ) 0 otherwise 


e x 1 = . . . . 
PRELU) = 20 seems to work well even though its derivative at x = 0 is 


where = 

Ox 0 0 
undefined. An advantage of ReLU over sigmoid is that ReLU does not saturate far from 
the origin. 





The output of the hidden gates is an encoding of the input. An image might be a 
10% dimensional input and there may only be 10° hidden gates. However, the number of 
images might be 107 so even though the dimension of the hidden layer is smaller than the 
dimension of the input, the number of possible codes far exceeds the number of inputs 
and thus the hidden layer is a compressed representation of the input. If the hidden layer 
were the same dimension as the input layer one might get the identity mapping. This 
does not happen for gradient descent starting with random weights. 


The output layer of a deep network typically uses a softmax procedure. Softmax is 
a generalization of logistic regression. Given a set of vectors {X1,X2,...Xn} with labels 
l1,la2,... ln, l; € {0,1}, a weight vector w defines the probability that the label / given x 
equals 0 or 1 by 


Prob(l = 1|x) = — = 0(w'x) 
e wx 
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Figure 5.6: A deep learning fully connected network. 


and 
Prob(/ = 0|x) = 1 — Prob(1 = 1/x) 


where ø is the sigmoid function. 


Define a cost function 


J(w) = y (i log(Prob(1 = 1|x;)) + (1 — l;) log(1 — Prob(l = 1x;))) 


= oe (i log(o(w*x;)) + (1 — l;) log(1 — o(wTx:)) 


and compute w to minimize J(x). Since Ed! = 0(w*x)(1—o(w*x))x;, it follows that 
Ow; J 


ogla wTx a wTx =. wTx Ti 
A E E Ps 











SOU e EII p Ea o(wTx))o(wTx) 
Ow; 2 (i o(wTx) S 1—o(wTx) ;) 
2 
2 


2 


LC = o(w?x))z; = (1 = li)o(w"x),) 


(1:(1 — oF 
( (tiv — l¡o(w*x)x;—o(w*x)x;+ Lo(wEx)a,) 
f —oa(w'x) )a;. 


dh Jey 


2 


Softmax is a generalization of logistic regression to multiple classes. Thus, the labels 
l; take on values {1,2,..., k}. For an input x, softmax estimates the probability of each 
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label. The hypothesis is of the form 


Prob(1 = 1|x, wi) ewix 

Prob(1 = 2|x, w2) 1 ew2 * 
hult) = | a 
: were 

Prob(1 = k|x, Wx) eWn* 


where the matrix formed by the weight vectors is 
W = (wi, We,... Wk) 


W is a matrix since for each label l;, there is a vector w; of weights. 


Consider a set of n inputs {x1,X2,...,Xn}. Define 


1 ifl=k 
0 otherwise 


a(t =) = { 


and 
n Wj Ti 


k 
e 
> d(l; = j) log =:——-. 
ae grami 


i=1 j=1 


The derivative of the cost function with respect to the weights is 


Vw J (W) == Sa (S(1, = k) — Prob(l; = k)|x;, W). 


j=1 


Note Vw;J(W) is a vector. Since wi is a vector, each component of Vw,J(W) is the 
derivative with respect to one component of the vector w;. 


Over fitting is a major concern in deep learning since large networks can have hun- 
dreds of millions of weights. In image recognition, the number of training images can 
be significantly increased by random jittering of the images. Another technique called 
dropout randomly deletes a fraction of the weights at each training iteration. Regulariza- 
tion is used to assign a cost to the size of weights and many other ideas are being explored. 


Deep learning is an active research area. We explore a few of the directions here. For 
a given gate one can construct an activation vector where each coordinate corresponds 
to an image and the coordinate value is the gate’s output for the corresponding image. 
Alternatively one could define an image activation vector whose coordinates correspond 
to gate values for the image.. Basically these activation vectors correspond to rows and 
columns of a matrix whose ij” element is the output of gate i for image j. 


A gate activation vector indicates which images an individual gate responds to. The 
coordinates can be permuted so that the activation increases. This then gives the set of 
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images with high activation. To determine whether two gates learn the same feature one 
computes the covariance of the two gate activation vectors. 


El (a; — Ela;)) (a; — Ela;))) 
0(a,)0(a;) 

A value close to +1 indicates a strong relationship between what gates i and j learned. 
Another interesting question is if a network is trained twice starting from different sets of 
random weights, do the gates learn the same features or do they carry out the classification 
in totally different ways. To match gates that learn the same features one constructs a 
matrix where the ijt” entry is the covariance of the gate i activation vector in one training 
and gate 7 activation vector in the other training. 





covariance(a;,a;) = 


Recreating image from activation vector 

Given an image one can easily get the corresponding gate activation vector. However, 
given a gate activation vector, how does one get the corresponding image. There are 
many ways to do this. One way would be to find the gate activation vector for a random 
image and then by gradient descent on the pixels of the image reduce the L2 norm be- 
tween the gate activation vector of the random image and the gate activation vector for 
which you want the image that produced it. 


Style transfer 

Given that one can produce the image that produced a given activation vector, one can 
reproduce an image from its content using the style of a different image. To do this define 
the content of an image to be the gate activation vector corresponding to the first level 
of gates in the network [GEB15]. The reason for selecting the first level of gates is that a 
network discards information in subsequent levels that is not relevant for classification, to 
define the style of an image, form a matrix with the inner product of the last activation 
vector with itself. Now one can create an image using the content of one image and the 
style of another. For example, you could create the image of a student using the style of 
a much older individual [GKL*15). 


Random weights 

An interesting observation is that one can do style transfer without training a network. 
One uses random weights instead. This raises the issue of which tasks require training 
and which tasks require only the structure of the network. 


Structure of activation space 

Understanding the structure of activation space is an important area of research. One 
can examine the region of space that corresponds to images of cats, or one can examine 
the the region of space that gets classified as cat. Every input even if it is random noise 
will get classified. It appears that every image in one classification is close to an image in 
each of the other classifications. 
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Fooling 

One can make small changes to an image that will change the image's classification without 
changing the image enough for a human to recognize that the image had been modified. 
There are many ways to do this. To change the classification of an image of a cat to 
automobile simply test each pixel value and see which way it increases the probability 
of the image being classified as an automobile. Create a 0,1,-1 matrix where 1 means 
increasing the pixel value increases the likelihood of an automobile classification and a -1 
means decreasing the pixel value increases the likelihood of an automobile classification. 
Then 

image + a(0,1,-1 matrix) 


will cause the modified image to be classified as an automobile for a small value of a. 
The reason for this is that the change of each pixel value makes a small change in the 
probability of the classification but a large number of small changes can be big. For more 
information about deep learning, see [Ben09].?* 


5.8.1 Generative Adversarial Networks (GANs) 


Image generation has become an important area where one enters a word or phrase 
into an image generation program which produces the desired image. One might ask why 
not search the web instead? If one wanted an image of a cat, searching the web would 
be fine. However, if one wants a more complex image of a cat watching someone fishing 
while the sun was setting, it might not be possible to find such an image. 


A method that is promising in trying to generate images that look real is to create 
code that tries to discern between real images and synthetic images. 
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One first trains the synthetic image discriminator to distinguish between real images and 
synthetic ones. Then one trains the image generator to generate images that the discrim- 
inator believes are real images. Alternating the training between the two units ends up 
forcing the image generator to produce real looking images. This is the idea of Generative 
Adversarial Networks. 





See also the tutorials: http://deeplearning.net/tutorial/deeplearning.pdf and 
http://deeplearning.stanford.edu/tutorial/. 
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There are many possible applications for this technique. Suppose you wanted to train 
a network to translate from English to German. First train a discriminator to determine 
if a sentence is a real sentence as opposed to a synthetic sentence. Then train a translator 
for English to German which translates the English words to German words in such a way 
that the output will convince the discriminator that it is a German sentence. Next take 
the German sentence and translate it to English words in such as way that the output will 
convince the discriminator that it is a English sentence. Finally train the entire system 
so the generated English sentence agrees with the original English sentence. At this point 
the German sentence is likely to be a good translation. 
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5.9 Gradient Descent 
The gradient of a function f(x) of d variables, x = (£1, %2,...,%q), at a point Xo is de- 
noted J f (xo). It is a d-dimensional vector with components (2, Biat, RS aq) ; 
where ge are partial derivatives. Without explicitly stating, we assume that the deriva- 


tives referred to exist. The rate of increase of the function f as we move from xo in a 
direction u is y7f(xp9) - u. So the direction of steepest descent is — Y f (Xo). This is a 
natural direction to move to minimize f. 


By how much should we move? This is primarily an experimental area and how much we 
should move depends on the specific problem.. In general one starts with a large move 
and when a point is reached where the function no longer decreases since one is possibly 
overshooting the minimum, the move size is reduced. Another option is to use a quadratic 
approximation and use both the first and second derivatives to determine how far to move. 
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There are significantly improved versions of gradient descent involving momentum and 
various versions of it such as Nesterov's Accelerated Gradient. If the minimum is in a 
long narrow ravine this may result in an oscillation across the ravine slowing convergence 
rather than travel along the ravine. Momentum is one method of forcing the move along 
the ravine. The formula is 


v= ywotavf 
X=xXx-v 


Here v is a velocity vector, a is the learning rate, and y is a constant that is usually set 
to 0.5 until the initial learning stabilizes and then increased to 0.9. The velocity vector 
averages out the oscillation across the ravine and helps steer the descent in high dimen- 
sional problems. However, one may not wish to move down the bottom of the narrow 
ravine if the slope is shallow. Although one can improve the training error slightly, one 
really wants to improve the true error and avoid over fitting by a regularizer that reduces 
some measure of complexity. 


The step size and how it is changed is important. For educational purposes we now 
focus on infinitesimal gradient descent where the algorithm makes infinitesimal moves in 
the —7f(xo) direction. Whenever Yf is not the zero vector, we strictly decrease the 
function in the direction —yf, so the current point is not a minimum of the function. 
Conversely, a point x where Vf = 0 is called a first-order local optimum of f. A first-order 
local optimum may be a local minimum, local maximum, or a saddle point. We ignore 
saddle points since numerical error is likely to prevent gradient descent from stoping at 
a saddle point at least in low dimensions. In a million dimensions, if there is a decrease 
in only one dimension and in a million dimensions there is an increase, gradient descent 
is not likely to find the decrease and stop at the saddle point thinking it is a local minima. 


In general, local minima do not have to be global minima, and gradient descent may 
converge to a local minimum that is not a global minimum. When the function f is 
convex, this is not the case. A function f of a single variable x is said to be convex if 
for any two points a and b, the line joining f(a) and f(b) is above the curve f(-). A 
function of many variables is convex if on any line segment in its domain, it acts as a 
convex function of one variable on the line segment. 


Definition 5.4 A function f over a convex domain is a convex function if for any two 
points x and y in the domain, and any X in [0,1] 


f(Ax + (1—A)y) <A f(x) + (1— Af (y). 
The function is concave if the inequality is satisfied with > instead of <. 


Theorem 5.18 Suppose f is a convex, differentiable function defined on a closed bounded 
convex domain. Then any first-order local minimum is also a global minimum. Infinites- 
imal gradient descent always reaches the global minimum. 
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Proof: We will prove that if x is a local minimum, then it must be a global minimum. 
If not, consider a global minimum point y # x. On the line joining x and y, the function 
must not go above the line joining f(x) and f(y). This means for an infinitesimal e > 0, 
moving distance e from x towards y, the function must decrease, so V7f(x) is not 0, 
contradicting the assumption that x is a local minimum. A 


The second derivatives ae form a matrix, called the Hessian, denoted H(f(x)). 


The Hessian of f at x is a symmetric d x d matrix with ij” entry ¿EL (x). The second 
¡Oj 
derivative of f at x in the direction u is the rate of change of the first derivative in the 


direction u from x. It is easy to see that it equals 





u” H(f(x))u. 


To see this, note that the second derivative of f along the unit vector u is 


Doz (vit La Fa t “fe 
T2 u “j uao, 








Theorem 5.19 Suppose f is a function from a closed conver domain in R* to the reals 
and the Hessian of f exists everywhere in the domain. Then f is convex (concave) on the 
domain if and only if the Hessian of f is positive (negative) semi-definite everywhere on 
the domain. 


Gradient descent requires the gradient to exist. But, even if the gradient is not always 
defined, one can minimize a convex function over a convex domain efficiently, i.e., in 
polynomial time. Technically, one can only find an approximate minimum with the time 
depending on the error parameter as well as the presentation of the convex set. We do not 
go into these details. But, in principle we can minimize a convex function over a convex 
domain. We can also maximize a concave function over a concave domain. However, in 
general, we do not have efficient procedures to maximize a convex function over a convex 
domain. It is easy to see that at a first-order local minimum of a possibly non-convex 
function, the gradient vanishes. But a second-order local decrease of the function may 
be possible. The steepest second-order decrease is in the direction of +v, where, v is the 
eigenvector of the Hessian corresponding to the largest absolute valued eigenvalue. 


5.9.1 Stochastic Gradient Descent 


We now describe a widely-used algorithm in machine learning, called stochastic gradi- 
ent descent . Often the function f(x) that we are trying to minimize is actually the sum 
of many simple functions. We may have 100,000 images that we are trying to classify and 
f(x) = X; f(x;) where f(x;) is the error for the i image. The function f may have a 
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Figure 5.7: Error function for a single image 
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Figure 5.8: Full error function for 1,000 images. 


million weights and gradient descent would take the derivation of the 100,000 terms in f 
with respect to each of the million weights. A better method might be at each iteration 
to randomly select one of the images and take the derivatives with respect to just a single 
term in the summation. This should speed the convergence up significantly. In practice, 
initially one randomly selects one term each iteration, after a number of iterations 50 
terms are selected, then maybe 200, and finally full gradient descend. This usually get 
one to a better minimum then the full gradient descent. So not only is it faster but it 
gets a better solution. 


To understand why stochastic gradient descent sometimes gets a better minimum than 
gradient descent consider Figures 5.7 and 5.8 Here the true error function is the sum of 
1,000 simpler error functions. If in Figure 5.8 one starts gradient descent at 1000, they 
probably would get stuck at the local minimum around 750. However, stochastic gradient 
descent would alternate in minimizing different simple error functions and would move 
into the region 180 to 600 if the batch size was one. If one then switched the batch size 
to 50, the variance would decrease and one might range around 450 to 600. Finally with 
gradient descent they would converge to a much better minimum. 


We have been talking about a local minimum being good in terms of the training 
error. What we also want is good generalization. If the training data is a good sample of 
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the full data, then intuitively a broad local minimum should be better than a sharp local 
minimum since a small shift in the error function for a broad minimum will not result in 
a large increase in error where as a small shift will cause a large increase in error for a 
sharp local minimum. 


5.9.2 Regularizer 


Often one modifies the error function by adding a regularizer term to achieve some prop- 
erty. If the amount of data is not sufficient to prevent over fitting one might in training 
a deep network add a term that would penalize the number of non-zero weight vectors. 
A complex network might have 10% training error with 20% real error where as a simple 
network might have 15% training error with 20% real error. With a regularizer one might 
balance the size of the network with training error and get a training error of 12% and 
real error 18%. 


5.10 Online Learning 


So far we have been considering batch learning. You are given a “batch” of data, a 
training sample S, and your goal is to use the training example to produce a hypothesis 
h that will have low error on new data. We now switch to the more challenging online 
learning scenario where we remove the assumption that data is sampled from a fixed 
probability distribution, or from any probabilistic process at all. 


The online learning scenario proceeds as follows. At each time t = 1,2,..., two events 
occur: 


1. The algorithm is presented with an arbitrary example x, € Y and is asked to make 
a prediction 4 of its label. 


2. Then, the algorithm is told the true label of the example c*(x;) and is charged for 
a mistake if c*(x¿) # &. 


The goal of the learning algorithm is to make as few mistakes as possible. For example, 
consider an email classifier that when a new email message arrives must classify it as 
“important” or “it can wait”. The user then looks at the email and informs the algorithm 
if 1t was incorrect. We might not want to model email messages as independent random 
objects from a fixed probability distribution because they often are replies to previous 
emails and build on each other. Thus, the online learning model would be more appro- 
priate than the batch model for this setting. 


Intuitively, the online learning model is harder than the batch model because we have 
removed the requirement that our data consists of independent draws from a fixed proba- 
bility distribution. Indeed, we will see shortly that any algorithm with good performance 
in the online model can be converted to an algorithm with good performance in the batch 
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model. Nonetheless, the online model can sometimes be a cleaner model for design and 
analysis of algorithms. 


5.10.1 An Example: Learning Disjunctions 


A disjunctionis an OR of features. For example, an email may be important if it comes 
from one of your instructors, or comes from a family member, or is a reply to an email 
you sent earlier. Consider the problem of learning disjunctions over the instance space 
X = {0,1}4, that is, disjunctions over d Boolean features. There are 2% possible disjunc- 
tions over this instance space, ranging from the disjunction of nothing, which is always 
negative, to the disjunction h(x) = £1 V z2V ...V £a that is positive on any example with at 
least one feature set to 1. In general, a typical disjunction such as h(x) = 1, V 14 V £g will 
have some relevant variables (in this case, x1, 74, and xg) and some irrelevant variables 
(in this case, everything else), and outputs positive on any input that has one or more 
relevant variables set to 1. 


We can solve the problem of learning disjunctions in the online model by starting 
with the disjunction of all of the variables h(x) = 21 V 12 V ... V za. Our algorithm will 
maintain the invariant that every relevant variable for the target function is present in the 
hypothesis h, along with perhaps some irrelevant variables. This is certainly true at the 
start. Given this invariant, the only mistakes possible are on inputs x for which h(z) is 
positive but the true label c*(x) is negative. When such a mistake occurs, we just remove 
from h any variable set to 1 in x, since it can’t possibly be a relevant variable for the 
target function. This shrinks the size of the hypothesis h by at least 1, and maintains the 
invariant. This implies that the algorithm makes at most d mistakes total on any series 
of examples consistent with a target disjunction c*. In fact, one can show this bound 
is tight by showing that no deterministic algorithm can guarantee to make fewer than d 
mistakes. 


Theorem 5.20 For any deterministic online learning algorithm A, there exists a se- 
quence o of examples over {0,1}¢ and a target disjunction c* such that A makes at least 
d mistakes on the sequence of examples o labeled by c*. 


Proof: Let ø be the sequence e1, €2, . . . , €q Where e; is the example that sets every variable 
to zero except x; = 1. Imagine running A on sequence o and telling A it made a mistake 
on every example; that is, if A predicts positive on e; we set c*(e;) = —1 and if A predicts 
negative on e, we set c*(e;) = +1. This target corresponds to the disjunction of all x; 
such that A predicted negative on e;, so it is a legal disjunction. Since A is deterministic, 
the fact that we constructed c* by running A is not a problem: it would make the same 
mistakes if re-run from scratch on the same sequence and same target. Therefore, A 
makes d mistakes on this o and c*. A 
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5.10.2 The Halving Algorithm 


If we are not concerned with running time, a simple online algorithm that guarantees 
to make at most log,(|H|) mistakes for a target belonging to any given class H is called the 
halving algorithm. This algorithm simply maintains the version space V C H consisting 
of all h € H consistent with the labels on every example seen so far, and predicts based 
on majority vote over these functions. Each mistake is guaranteed to reduce the size of 
the version space V by at least half (hence the name), thus the total number of mistakes 
is at most log,(|H|). Note that this can be viewed as the number of bits needed to write 
a function in H down. 


5.10.3 The Perceptron Algorithm 


Earlier we described the Perceptron Algorithm as a method for finding a linear sep- 
arator consistent with a given training set S. However, the Perceptron Algorithm also 
operates naturally in the online setting. 


Recall that the basic assumption of the Perceptron Algorithm is that the target func- 
tion can be described by a vector w* such that for each positive example x we have 
x"w* > 1 and for each negative example x we have x’ w* < —1. Recall also that we 
can interpret x’ w*/|w*| as the distance of x to the hyperplane x’ w* = 0. Thus, our 
assumption states that there exists a linear separator through the origin with all posi- 
tive examples on one side, all negative examples on the other side, and all examples at 
distance at least y = 1/|w*| from the separator, where y is called the margin of separation. 


The guarantee of the Perceptron Algorithm will be that the total number of mistakes 
is at most (r/7)? where r = max; |x;| over all examples x; seen so far. Thus, if there exists 
a hyperplane through the origin that correctly separates the positive examples from the 
negative examples by a large margin relative to the radius of the smallest ball enclosing 
the data, then the total number of mistakes will be small. The algorithm, restated in the 
online setting, is as follows. 


The Perceptron Algorithm: Start with the all-zeroes weight vector w = 0. Then, for 
t= 1,2,... do: 


1. Given example x;, predict sgn(x7T w). 
2. If the prediction was a mistake, then update w — w + xli. 


The Perceptron Algorithm enjoys the following guarantee on its total number of mis- 
takes. 


Theorem 5.21 On any sequence of examples X1,X2,..., if there exists a vector w* such 


that x} w*l, > 1 for allt, i.e., a linear separator of margin y = 1/|w*|, then the Perceptron 
Algorithm makes at most r?|w*|? mistakes, where r = max; |x;|. 
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* 


Proof: Fix some consistent w*. Keep track of two quantities, w 
mistake increases w? w* by at least 1 since 


w* and |w|?. Each 


(w+ xil)” w* = wi w* + x) 1,w* Sw wr +1. 
Next, each mistake increases |w|? by at most r?. For each mistake 
(w + xili)" (w + xili) = |w]? + 2x7 lew + [xe]? < [w + |x? < |w? +17, 


where the middle inequality comes from the fact that x/l,w < 0. Note that it is impor- 
tant here that we only update on a mistake. 

So, if we make m mistakes, then w’w* > m, and |w|? < mr?, or equivalently, 
[w| < rym. Finally, we use the fact that w7 w*/|w*| < |w| which is just saying that the 
projection of w in the direction of w* cannot be larger than the length of w. This gives 
us: 


m < w'w* 
m/|w"| < |w] 
m/\w*| < rym 
vm < r|w*| 
m < r’jw*? 
as desired. E 


5.10.4 Inseparable Data and Hinge Loss 


We assumed above that there exists a perfect w* that correctly classifies all the ex- 
amples, e.g., correctly classifies all the emails into important versus non-important. This 
is rarely the case in real-life data. What if even the best w* isn’t perfect? We can see 
what this does to the above proof (Theorem 5.21). If there is an example that w* doesn’t 
correctly classify, then while the second part of the proof still holds, the first part (the dot 
product of w with w* increasing) breaks down. However, if this doesn’t happen too of- 
ten, and also x/ w* is just a “little bit wrong” then we will only make a few more mistakes. 


Define the hinge-loss of w* for a positive example x; as the amount x? w* is less than 
one and for a negative example x; as the amount x/ w* is greater than minus one. That 
is, Lhinge(w*, £i) = max(0,1 — x} w*l;). The total hinge-loss, Lhinge(w*, S) for a set of 
examples S is the sum of the hinge-loss of each example in S. 


Theorem 5.22 On any sequence of examples S = X1, X2,..., the Perceptron Algorithm 
makes at most 
min (r?|w*|? + 2Dninge(W", S)) 


mistakes, where r = max; |x+|. 
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Proof: As before, each update of the Perceptron Algorithm increases |w|? by at most 
r?, so if the algorithm makes m mistakes, we have |w|? < mr?. What we can no longer 
say is that each update of the algorithm increases w’ w* by at least one. Instead, on 
a positive example we are “increasing” w!w* by x/ w* (it could be negative), which is 
at least 1 — Lningel[w*,X:). Similarly, on a negative example we “increase” w"w* by 
—x | w*, which is also at least 1— Lpinge( W*, X+). Summing this up over all mistakes yields 
w"w* > m— LhingelW*, S), where we are using the fact that hinge-loss is never negative 


so summing over all of S is only larger than summing over the mistakes that w made. 


Let l = Lhinge(w*, S). Then 


ww" < |wi|w*| 
(mi? < |wP hw"? 
m? — 2m +? < mr?|w*|? 
m—-21+P/m < r?[w*? 
m < r?w*!? +2 -l/m 
m < rjw 2 
as desired. E 


5.10.5 Online to Batch Conversion 


Suppose we have an online algorithm with a good mistake bound, such as the Percep- 
tron Algorithm. Can we use it to get a guarantee in the distributional (batch) learning 
setting? Intuitively, the answer should be yes since the online setting is only harder. 
Indeed, this intuition is correct. We present here two natural approaches for such online 
to batch conversion. 


Conversion procedure 1: Random Stopping. 

Suppose we have an online algorithm A with mistake-bound m. Run the algorithm 
in a single pass on a sample S of size m/e. Let ip be the indicator random variable for 
the event that A makes a mistake on example x,. Since yl i < m for any set S, 
E| ee i] < m where the expectation is taken over the random draw of S from D!*!. By 
linearity of expectation, and dividing both sides by |S] 


1 || 
[51 2 Eta < m/|8| =e. (5.2) 
Let h; denote the hypothesis used by algorithm A to predict on the t'” example. Since 
the t'” example was randomly drawn from D, Elerrp(h;)| = Eli¿]. This means that if we 
choose t at random from 1 to |S], that is, stopping the algorithm at a random time, the 
expected error of the resulting prediction rule, taken over the randomness in the draw of 
S and the choice of t, is at most e as given by equation (5.2). Thus we have the following 
theorem. 
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Theorem 5.23 (Online to Batch via Random Stopping) /f an online algorithm A 
with mistake-bound m is run on a sample S of size m/e and stopped at a random time 
between 1 and |S|, the expected error of the hypothesis h produced satisfies Elerrp(h)) < e. 


Conversion procedure 2: Controlled Testing. 

A second approach to using an online learning algorithm A for learning in the distri- 
butional setting is as follows. At each step of algorithm A, we test its current hypothesis 
h on a large enough sample to determine if with high probability, the true error of h is 
less than e. If h correctly classifies all the examples in the set, then we stop and output 
the hypothesis h. Otherwise, we we select an example x which was misclassified by h and 
submit it to the online algorithm A and get a new hypothesis. We repeat the process 
until we find an h that correctly classifies a large enough sample to ensure with probabil- 
ity great than or equal to 1—0, h's true error will be less than e. The technical details follow. 


Specifically, suppose that the initial hypothesis produced by algorithm A is hı. Define 
6; = ĝ/ (i +2)? so Dip ds = (= —1)6 < ô. Draw a set of ny = < log(;-) random examples 
and test to see whether hı gets all of them correct. Note that if errp(h,) > e, then the 
chance hı would get them all correct is at most (1 —€)™ < 6,. So, if hy indeed gets them 
all correct, we output hı as our hypothesis and halt. If not, we choose some example 
x, in the sample on which hı made a mistake and give it to algorithm A. Algorithm A 
then produces some new hypothesis ha and we again repeat, testing hz on a fresh set of 


n2 = i log(;-) random examples, and so on. 


In general, given h; we draw a fresh set of ny = z log(;-) random examples and test 
to see whether h, gets all of them correct. If so, we output h; and halt; if not, we choose 
some z; on which h(x) was incorrect and give it to algorithm A. By choice of n, if hy 
had error rate e or larger, the chance we would mistakenly output it is at most ô. By 
choice of the values ô,, the chance we ever halt with a hypothesis of error e or larger is at 
most 6; + d2 +... < ô. Thus, we have the following theorem. 


Theorem 5.24 (Online to Batch via Controlled Testing) Let A be an online learn- 
ing algorithm with mistake-bound m. Then this procedure will halt after O(@ log(*)) 
examples and with probability at least 1 — ô will produce a hypothesis of error at most e. 


Note that in this conversion we cannot re-use our samples. Since the hypothesis h; depends 
on the previous data, we need to draw a fresh set of n; examples to use for testing it. 


5.10.6 Combining (Sleeping) Expert Advice 


Imagine you have access to a large collection of rules-of-thumb that specify what to 
predict in different situations. For example, in classifying news articles, you might have a 
rule that says “if the article has the word ‘football’, then classify it as sports” and another 
that says “if the article contains a dollar figure, then classify it as business”. In predicting 
the stock market, there could be different economic indicators. These predictors might 
at times contradict each other, e.g., a news article that has both the word “football” and 
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a dollar figure, or a day in which two economic indicators are pointing in different direc- 
tions. It also may be that no predictor is perfectly accurate with some predictors much 
better than others. We present here an algorithm for combining a large number of such 
predictors with the guarantee that if any of them are good, the algorithm will perform 
nearly as well as each good predictor on the examples on which that predictor fires. 


Formally, define a “sleeping expert” to be a predictor h that on any given example x 
either makes a prediction on its label or chooses to stay silent (asleep). Now, suppose we 
have access to n such sleeping experts h;,...,h,, and let S; denote the subset of examples 
on which h; makes a prediction (e.g., this could be articles with the word “football” in 
them). We consider the online learning model, and let mistakes(A, S) denote the number 
of mistakes of an algorithm A on a sequence of examples S. Then the guarantee of our 
algorithm A will be that for all i 

E(mistakes(A, Si)) < (1+ €)mistakes(h;,S;) +O (e) 


€ 





where e is a parameter of the algorithm and the expectation is over internal randomness 
in the randomized algorithm A. 


As a special case, if h¡,...,h, are concepts from a concept class H, so they all make 
predictions on every example, then A performs nearly as well as the best concept in H. 
This can be viewed as a noise-tolerant version of the Halving Algorithm of Section 5.10.2 
for the case that no concept in H is perfect. The case of predictors that make predictions 
on every example is called the problem of combining expert advice, and the more general 
case of predictors that sometimes fire and sometimes are silent is called the sleeping experts 
problem. 


Combining Sleeping Experts Algorithm: 


Initialize each expert h; with a weight w; = 1. Let e € (0, 1). For each example x, do the 
following: 
1. [Make prediction] Let H,, denote the set of experts h; that make a prediction on x, and 
let ws = >> wj. Choose h; € Hy with probability Pis = 10,/w., and predict h;(1). 
hi€Hy 


2. [Receive feedback] Given the correct label, for each h; € H, let miz = 1 if hi(x) was 
incorrect, else let mj, = 0. 


3. [Update weights] For each h; € H,, update its weight as follows: 


e Let rie = (Dre, Pein) /(1 + €) — mic. 
e Update w; + w;(1 + e)”. 
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Note that >), cn, PjoMja represents the algorithm's probability of making a mis- 
take on example x. So, h; is rewarded for predicting correctly (Mis = 0) when the 
algorithm had a high probability of making a mistake, and h; is penalized for pre- 
dicting incorrectly (Mis = 1) when the algorithm had a low probability of making 
a mistake. 


For each h; ¢ H,, leave w; alone. 


Theorem 5.25 For any set of n sleeping experts hy,...,hy, and for any sequence of 
examples S, the Combining Sleeping Experts Algorithm A satisfies for all i 


E(mistakes(A, Si)) < (1 + €)mistakes(h;, Si) + O (82) 
where Si = {x € S|h; € Hz}. 


Proof: Consider sleeping expert h;. The weight of h; after the sequence of examples S 
is exactly 


wi = (+ e) 768 (Erer Pjamja)/(1+6)=mia] 
= (1 y e lmietakesl A S/U mitico) 





Let w = $`; wz. Clearly w; < w. Therefore, taking logs: 
E(mistakes(A, S;))/(1+ €) — mistakes(h;, Si) < log,,.w. 


pew) 


€ ) 


So, using the fact that log, ,.ww = O( 
E(mistakes(A, S;)) < (1+e)mistakes(h;, S;) + O ("8"). 


Initially, w = n. To prove the theorem, it is sufficient to prove that w never increases. To 
do so, we need to show that for each z, Y pe, wil + ©)" < Y y ey, Wir or equivalently 
dividing both sides by Pte q, Wj that Y, Pic(1 + €)" < 1, where for convenience we 
define Piz = 0 for h; ¢ Hz. 


For this we will use the inequalities that for 6,z € [0,1], 6% < 1 — (1 — B)z and 
B-* <1+(1- B)2/8. Specifically, we will use 8 = (1+ 6)~!. We now have: 


S pie (1 + e) = Y pipas PjaMjw)B 
dP ( E Pmi) (: Fee) (Erem) 
j 


a p) - (1-6) 2 _PiaMis += p) > Pia D Pian 
1 (1 = 8) S PisMie + (1— 8) > Pjamija 


i 


IA 


IA 


= |i, 
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where the second-to-last line follows from using `; Pis = 1 in two places. So w never 
increases and the bound follows as desired. A 


5.11 Boosting 


We now describe boosting, which is important both as a theoretical result and as a 
practical and easy-to-use learning method. A strong learner for a problem is an algorithm 
that with high probability is able to achieve any desired error rate e using a sufficient 
number of samples that may depend polynomially on 1/e. A weak learner for a problem 
is an algorithm that does just a little bit better than random guessing. It is only required 
to get with high probability an error rate less than or equal to 5 — y for some 0 < y < 5. 
We show here that a weak-learner for a problem that achieves the weak-learning guaran- 
tee for any distribution of data can be boosted to a strong learner, using the technique 
of boosting. At the high level, the idea will be to take our training sample S, and run 
the weak-learner on different data distributions produced by weighting the points in the 
training sample in different ways. Running the weak learner on these different weight- 
ings of the training sample will produce a series of hypotheses hj, ho,.... The idea of the 
reweighting procedure will be to focus attention on the parts of the sample that previous 
hypotheses have performed poorly on. At the end the hypotheses are combined together 
by a majority vote. 


Assume the weak learning algorithm A outputs hypotheses from some class H. Our 
boosting algorithm will produce hypotheses that will be majority votes over ty hypotheses 
from AH, for ty defined below. By Theorem 5.13, the class of functions that can be produced 
by the booster running for to rounds has VC-dimension O(ty VCdim(H) log (ty VCdim(H))). 
This gives a bound on the number of samples needed, via Corollary 5.17, to ensure that 
high accuracy on the sample will translate to high accuracy on new data. 


To make the discussion simpler, assume that the weak learning algorithm A, when 
presented with a weighting of the points in our training sample, always (rather than 
with high probability) produces a hypothesis that performs slightly better than random 
guessing with respect to the distribution induced by weighting. Specifically: 


Definition 5.5 (y-Weak learner on sample) A y-weak learner is an algorithm that 
given examples, their labels, and a non-negative real weight w; on each example x;, pro- 
duces a classifier that correctly labels a subset of the examples with total weight at least 


CAN Ui 


1 


At the high level, boosting makes use of the intuitive notion that if an example was 
misclassified, one needs to pay more attention to it. More specifically, boosting multiplies 
the weight of the misclassified examples by a value a > 1 designed to raise their total 
weight to equal the total weight of the correctly-classified examples. The boosting proce- 
dure is in Figure 5.9. 
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Boosting Algorithm 


Given a sample S of n labeled examples x;,...,xX», initialize each 
example x; to have a weight w; = 1. Let w = (wi,..., Wn). 


For t = 1,2,..., to do 


Call the weak learner on the weighted sample (S, w), receiving 
hypothesis h4. 

Multiply the weight of each example that was misclassified by 
2+7 
= 





. Leave the other weights as they are. 


NT =. 


h; by a = 
End 


Output the classifier MAJ (h4, . . . , hto) which takes the majority vote 
of the hypotheses returned by the weak learner. Assume ty is odd so 
there is no tie. 


Figure 5.9: The boosting algorithm 


Theorem 5.26 Let A be a y-weak learner for sample S of n samples. Then to = 
O(=5 logn) is sufficient so that the classifier MAJ(hy,...,ht,) produced by the boosting 
procedure has training error zero. 


Proof: Suppose m is the number of examples the final classifier gets wrong. Each of 
these m examples was misclassified at least tọ/2 times so each has weight at least ato. 
Thus the total weight is at least ma‘*/?. On the other hand, at time t+1, only the weights 
of examples misclassified at time t were increased. By the property of weak learning, the 
total weight of misclassified examples is at most (3 — y) of the total weight at time t. Let 
weight(t) be the total weight at time t. Then 


weight(t + 1) < (a (4-7) + (+7) ) x weight(t) = (1+ 2y) x weight(t). 


1 
where q = an is the constant that the weights of misclassified examples are multiplied 


2 
by. Since weight(0) = n, the total weight at the end is at most n(1 + 2y)%. Thus 





mo"? < total weight at end < n(1 +27)". 


1/2+7 _ 1+2 R 
V In and solving for m 





Substituting a = 


m < n(1-2%(1 +2 = nfl — 47e. 


x 2 : a‘! 
Using 1— x < e,m < ne Y. For tp > a m < 1, so the number of misclassified 


items must be zero. k 
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Having completed the proof of the boosting result, here are two interesting observa- 
tions: 


Connection to Hoeffding bounds: The boosting result applies even if our weak learn- 
ing algorithm is “adversarial”, giving us the least helpful classifier possible subject 
to Definition 5.5. This is why we don't want the a in the boosting algorithm to 
be larger than a since in that case the misclassified examples would get higher 

total weight than the correctly classified examples, and the weak learner could then 
choose to just return the negation of the classifier it gave the last time. Suppose 
that the weak learning algorithm gave a classifier each time that for each example, 
flipped a coin and produced the correct answer with probability 3+7 and the wrong 
answer with probability 3 — y, so it is a y-weak learner in expectation. In that case, 
if we called the weak learner to times, for any fixed x;, Hoeffding bounds imply the 
chance the majority vote of those classifiers is incorrect on x; is at most eto” So, 
the expected total number of mistakes m is at most ne"207. What is interesting 
is that this is the exact bound we get from boosting without the expectation for an 
adversarial weak-learner. 





A minimax view: Consider a 2-player zero-sum game ? with one row for each example 
x; and one column for each hypothesis h; that the weak-learning algorithm might 
output. If the row player chooses row 7 and the column player chooses column 7, 
then the column player gets a payoff of one if h;(x;) is correct and gets a payoff 
of zero if h;(x;) is incorrect. The y-weak learning assumption implies that for any 
randomized strategy for the row player (any “mixed strategy” in the language of 
game theory), there exists a response h; that gives the column player an expected 
payoff of at least z + y. The von Neumann minimax theorem * states that this 
implies there exists a probability distribution on the columns (a mixed strategy for 
the column player) such that for any x;, at least a z + y probability mass of the 
columns under this distribution is correct on x;. We can think of boosting as a 
fast way of finding a very simple probability distribution on the columns (just an 
average over O(log n) columns, possibly with repetitions) that is nearly as good (for 
any x;, more than half are correct) that moreover works even if our only access to 
the columns is by running the weak learner and observing its outputs. 


We argued above that ty = O(2 logn) rounds of boosting are sufficient to produce a 
majority-vote rule h that will classify all of S correctly. Using our VC-dimension bounds, 
this implies that if the weak learner is choosing its hypotheses from concept class H, then 





25 A two person zero sum game consists of a matrix whose columns correspond to moves for Player 1 
and whose rows correspond to moves for Player 2. The ijt” entry of the matrix is the payoff for Player 
1 if Player 1 choose the jt” column and Player 2 choose the i*” row. Player 2’s payoff is the negative of 
Player1's. 

26 The von Neumann minimax theorem states that there exists a mixed strategy for each player so that 
given Player 2’s strategy the best payoff possible for Player 1 is the negative of given Player 1’s strategy 
the best possible payoff for Player 2. A mixed strategy is one in which a probability is assigned to every 
possible move for each situation a player could be in. 
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a sample size 





me) 


is sufficient to conclude that with probability 1 — ô the error is less than or equal to e, 
where we are using the O notation to hide logarithmic factors. It turns out that running 
the boosting procedure for larger values of ty, i.e., continuing past the point where S is 
classified correctly by the final majority vote, does not actually lead to greater overfitting. 
The reason is that using the same type of analysis used to prove Theorem 5.26, one can 
show that as to increases, not only will the majority vote be correct on each x € S, but in 
fact each example will be correctly classified by a 5 + y fraction of the classifiers, where 
y! + 7 as ty > œ. That is, the vote is approaching the minimax optimal strategy for 
the column player in the minimax view given above. This in turn implies that h can be 
well-approximated over S by a vote of a random sample of O(1/77) of its component weak 
hypotheses h;. Since these small random majority votes are not overfitting by much, our 
generalization theorems imply that h cannot be overfitting by much either. 


5.12 Further Current Directions 


We now briefly discuss a few additional current directions in machine learning, focusing 
on semi-supervised learning, active learning, and multi-task learning. 


5.12.1 Semi-Supervised Learning 


Semi-supervised learning uses a large unlabeled data set U to augment a given labeled 
data set L to produce more accurate rules than would have been achieved using just L 
alone. In many settings (e.g., document classification, image classification, speech recog- 
nition), unlabeled data is much more plentiful than labeled data, so one would like to 
make use of it. Unlabeled data is missing the labels but often contains information that 
an algorithm can take advantage of. 


Suppose one believes the target function is a linear separator that separates most of 
the data by a large margin. By observing enough unlabeled data to estimate the proba- 
bility mass near to any given linear separator, one could in principle discard separators 
in advance that slice through dense regions and instead focus attention on those that 
separate most of the distribution by a large margin. This is the high level idea behind 
a technique known as Semi-Supervised SVMs. Alternatively, suppose data objects can 
be described by two different kinds of features (e.g., a webpage could be described using 
words on the page itself or using words on links pointing to the page), and one believes 
that each kind should be sufficient to produce an accurate classifier. Then one might want 
to train a pair of classifiers, one on each type of feature, and use unlabeled data for which 
one classifier is confident but the other is not to bootstrap, labeling such examples with 
the confident classifier and then feeding them as training data to the less-confident one. 
This is the high-level idea behind a technique known as Co-Training. Or, if one believes 


173 


“similar examples should generally have the same label”, one might construct a graph 
with an edge between examples that are sufficiently similar, and aim for a classifier that 
is correct on the labeled data and has a small cut value on the unlabeled data; this is the 
high-level idea behind graph-based methods. 


A formal model: The batch learning model introduced in Section 5.4 in essence as- 
sumes that one's prior beliefs about the target function be described in terms of a class of 
functions H. In order to capture the reasoning used in semi-supervised learning, we need 
to also describe beliefs about the relation between the target function and the data distri- 
bution. One way to do this is via a notion of compatibility x between a hypothesis h and 
a distribution D. Formally, x maps pairs (h, D) to [0, 1] with x(h, D) = 1 meaning that h 
is highly compatible with D and x(h, D) = 0 meaning that h is highly incompatible with 
D. For example, if you believe that nearby points should generally have the same label, 
then if h slices through the middle of a high-density region of D (a cluster), you might 
call h incompatible with D, whereas if no high-density region is split by h then you might 
call it compatible with D. The quantity 1— x(h, D) is called the unlabeled error rate of h, 
and denoted erryni(h). Note that for x to be useful, it must be estimatable from a finite 
sample; to this end, let us further require that x is an expectation over individual exam- 
ples. That is, overloading notation for convenience, we require x(h, D) = E,.p|x(h, 2)], 
where x: H x X — [0,1]. 


For instance, suppose we believe the target should separate most data by margin y. 
We can represent this belief by defining y(h,x) = 0 if x is within distance y of the de- 
cision boundary of h, and y(h,x) = 1 otherwise. In this case, erryn)(h) will denote the 
probability mass of D within distance y of h’s decision boundary. As a different example, 
in co-training, we assume each example x can be described using two “views” x, and 
x2 that each are sufficient for classification. That is, we assume there exist c and c 
such that for each example x = (11,12) we have cj(x,) = ci(12). We can represent this 
belief by defining a hypothesis h = (hı, h2) to be compatible with an example (z1, £2) 
if hi(21) = ha(x2) and incompatible otherwise; err,,)(h) is then the probability mass of 
examples on which hı and ha disagree. 


As with the class H, one can either assume that the target is fully compatible, i.e., 
erTuni[c*) = 0, or instead aim to do well as a function of how compatible the target is. 
The case that we assume c* € H and errunilc*) = 0 is termed the “doubly realizable 
case”. The concept class H and compatibility notion x are both viewed as known. 


Suppose one is given a concept class H (such as linear separators) and a compatibility 
notion x (such as penalizing h for points within distance y of the decision boundary). 
Suppose also that one believes c* € H (or at least is close) and that errunilc*) = 0 (or at 
least is small). Then, unlabeled data can help by allowing one to estimate the unlabeled 
error rate of all h € H, thereby in principle reducing the search space from H (all linear 
separators) down to just the subset of H that is highly compatible with D. The key 
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challenge is how this can be done efficiently (in theory, in practice, or both) for natural 
notions of compatibility, as well as identifying types of compatibility that data in impor- 
tant problems can be expected to satisfy. 


The following is a semi-supervised analog of our basic sample complexity theorem, 
Theorem 5.4. First, fix some set of functions H and compatibility notion x. Given a 
labeled sample L, define érr(h) to be the fraction of mistakes of h on L. Given an 
unlabeled sample U, define x(h, U) = E,.y[x(h,x)] and define érrun(h) = 1 — x(h, U). 
That is, err(h) and éfTunı(h) are the empirical error rate and unlabeled error rate of h, 
respectively. Finally, given a > 0, define Hp (a) to be the set of functions f € H such 
that errunul f) < a. 


Theorem 5.27 If c € H, then with probability at least 1 — 6, for labeled set L and 
unlabeled set U drawn from D, the h € H that optimizes érTun(h) subject to err(h) = 0 
will have errp(h) < e for 


2 4 1 2 
U| > = 0 + In 5 , and |L| > = E [Ho (erruni(c*) + 2€)| + In q . 
€ € 


Equivalently, for |U| satisfying this bound, for any |L|, whp the h € H that minimizes 
erTruni(h) subject to érr(h) = 0 has 


1 2 
errp(h) < in in [Ho (errun(c") + 2e)| + In 5 : 
Proof: By Hoeffding bounds, |U| is sufficiently large so that with probability at least 
1 — ô/2, all h € H have |érrun(h) — erruni(h)| < e. Thus we have: 


{f E Hl€rruni(f) < errun(c*) + €) C Hp, (erruni(c*) + 26). 


The given bound on |L] is sufficient so that with probability at least 1 — ô, all h € H with 
éerr(h) = 0 and €rTuni(h) < errunilc*) + € have errp(h) < e; furthermore, érrun (cr) < 
erTunilc*) + e, so such a function h exists. Therefore, with probability at least 1 — ô, the 
h € H that optimizes €r7yni(h) subject to er7(h) = 0 has errp(h) < e, as desired. E 


One can view Theorem 5.27 as bounding the number of labeled examples needed to learn 
well as a function of the “helpfulness” of the distribution D with respect to x. Namely, 
a helpful distribution is one in which Hp (a) is small for a slightly larger than the 
compatibility of the true target function, so we do not need much labeled data to identify a 
good function among those in Hp (a). For more information on semi-supervised learning, 


see [BB10, BM98, CSZ06, Joa99, Zhu06, ZGLO3). 


5.12.2 Active Learning 


Active learning refers to algorithms that take an active role in the selection of which ex- 
amples are labeled. The algorithm is given an initial unlabeled set U of data points drawn 
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from distribution D and then interactively requests for the labels of a small number of 
these examples. The aim is to reach a desired error rate e using many fewer labels than 
would be needed by labeling random examples, i.e., passive learning. 


Suppose that data consists of points on the real line and H = { fal falx) = 1 iff x > a} 
for a € R. That is, H is the set of all threshold functions on the line. It is not hard to 
show (see Exercise 5.12) that a random labeled sample of size O(? log(3)) is sufficient to 
ensure that with probability greater than or equal to 1 — ô, any consistent threshold a’ 
has error at most e. Moreover, it is not hard to show that (+) random examples are 
necessary for passive learning. Suppose that the data consists of points in the interval 
[0, 1] where the points in the interval [0, a) are negative and the points in the interval 
la, 1] are positive. Given a hypothesis set ([b, 1]/0 < b < 1}, a random labeled sample of 
size O (+ log 2) is sufficient to ensure that with probability greater than or equal to 1 — 6 
any hypothesis with zero training error has true error at most e. However, with active 
learning we can achieve error e using only O (log(+) + loglog(+)) labels. The idea is as 
follows. Assume we are given an unlabeled sample U of size O(2 log(+)). Now, query the 
leftmost and rightmost points. If both are negative, output b = 1. If both are positive, 
output b = 0. Otherwise (the leftmost is negative and the rightmost is positive), use 
binary search to find two adjacent examples x,’ of U such that x is negative and 2’ is 
positive, and output b = (a + 2’)/2. This threshold b is consistent with the labels on 
the entire set U, and so by the above argument, has error less than or equal to e with 
probability greater than or equal to 1 — ô. 


The agnostic case, where the target need not belong in the given class H is quite a bit 
more subtle, and addressed in a quite general way in the “A?” Agnostic Active learning 
algorithm [BBL09]. For more information on active learning, see [Das11, BU14]. 


5.12.3 Multi-Task Learning 


In this chapter we have focused on scenarios where our goal is to learn a single target 
function c*. However, there are also scenarios where one would like to learn multiple target 


* 


functions cj,C3,...,C;- If these functions are related in some way, then one could hope to 


do so with less data per function than one would need to learn each function separately. 
This is the idea of multi-task learning. 


One natural example is object recognition. Given an image x, cj(x) might be 1 if x is 
a coffee cup and 0 otherwise; c}(x) might be 1 if x is a pencil and 0 otherwise; ch(x) might 
be 1 if x is a laptop and 0 otherwise. These recognition tasks are related in that image 
features that are good for one task are likely to be helpful for the others as well. Thus, 
one approach to multi-task learning is to try to learn a common representation under 
which each of the target functions can be described as a simple function. Another natural 
example is personalization. Consider a speech recognition system with n different users. 
In this case there are n target tasks (recognizing the speech of each user) that are clearly 
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related to each other. Some good references for multi-task learning are [TM95, Thr96)]. 
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5.14 Exercises 


Exercise 5.1 Each of the following data sets consists of a subset of the d-dimensional 
0/1 vectors labeled +1. The remaining 0/1 vectors are labeled -1. Which sets are linearly 
separable? 


1. {010, 011, 100, 111} 
2. {011, 100, 110, 111} 


3. {0100, 0101, 0110, 1000, 1100, 1101, 1110, 1111} 


Exercise 5.2 Run the Perceptron Algorithm on each of the examples in Exercise 5.1. 
What happens? 


Exercise 5.3 (representation and linear separators) A logical disjunction is the or 
of a set of Boolean variables such as xı V 19 V x4. Show that any disjunction over {0, 1}4 
can be represented as a linear separator. Show that moreover the margin of separation is 


Q(1/vd). 


Exercise 5.4 (representation and linear separators) Show that the parity function 
on d > 2 Boolean variables cannot be represented by a linear threshold function. The 
parity function is 1 if and only if an odd number of inputs is 1. 


Exercise 5.5 Given two sets of d dimensional vectors S and Sa how would you determine 
if the convex hulls of the two sets intersected? 


Solution: Use perceptron algorithm. A 
Exercise 5.6 (kernels) Prove Theorem 5.3. 
Exercise 5.7 Find the mapping y(x) that gives rise to the kernel 

K(x, y) = (21Y1 + 2242)”. 


Exercise 5.8 Give an example of overfitting, that is, where training error is much less 
than true error. 


Exercise 5.9 One popular practical method for machine learning is to learn a decision 
tree (Figure 5.10). While finding the smallest decision tree that fits a given training sample 
S is NP-hard, there are a number of heuristics that are used in practice. Suppose we run 
such a heuristic on a training set S and it outputs a tree with k nodes. Show that such a 
tree can be described using O(k log d) bits, where d is the number of features. Assume all 
features are binary valued (as in Figure 5.10). By Theorem 5.7, we can be confident the 
true error is low if we can produce a consistent tree with fewer than e|S|/log(d) nodes. 
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Figure 5.10: A decision tree with three internal nodes and four leaves. This tree corre- 
sponds to the Boolean function 1,73 V 1,1213 V 1213. 


Exercise 5.10 Consider the class of OR-functions (disjunctions) over d binary (0/1- 
valued) features. For instance, one such OR-function is 11 V £2 V x4. Using Theorem 5.4, 
how many examples are sufficient so that with probability at least 1—6, only OR-functions 
of true error less than e will have zero training error? 


Exercise 5.11 Consider the instance space X = {0,1}! and let H be the class of 3-CNF 
formulas. That is, H is the set of concepts that can be described as a conjunction of clauses 
where each clause is an OR of up to 3 literals. (These are also called 3-SAT formulas). 
For example œ might be (11 V T2 V £3)(£2 V 24)(21 V 23)(12 V 13 V 14). Assume we are 
in the PAC learning setting, so examples are drawn from some underlying distribution D 
and labeled by some 3-CNF formula c*. 


1. Give a number of samples m that would be sufficient to ensure that with probability 
greater than or equal to 1 — ô, all 3-CNF formulas consistent with the sample have 
error at most e with respect to D. 


2. Give a polynomial-time algorithm that finds a 3-CNF formula consistent with the 
sample 1f one exists. 


Exercise 5.12 Consider an instance space X consisting of integers 1 to 1,000,000 and a 
target concept c* where c*(i) = 1 for 500001 < i < 1000000. If your hypothesis class H is 
(h;lh;(i) =1 fori > j and h;(i) =0 fori < j} how large a training set S do you need to 
insure that with probability 99% any consistent hypothesis in H will have true error less 
than 10%. 


Exercise 5.13 Consider a deep network with 100,000 parameters each given by a 82-bit 
floating point number. Suppose the network is trained on 100,000,000 training examples. 
Corollary 5.8 says that with probability 99.9% the true error will differ from the empirical 
error by some e. What is the value of e? 


Exercise 5.14 (Regularization) Pruning a decision tree: Let S be a labeled sample 
drawn iid from some distribution D over {0,1}", and suppose S was used to create 
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some decision tree T'. However, the tree T is large, and we might be overfitting. Give 
a polynomial-time algorithm for pruning T' that finds the pruning h of T that optimizes 
the right-hand-side of Corollary 5.8, i.e., that for a given 6 > 0 minimizes: 








ae / size(h) En i tn(2/8) 


To discuss this, we define the meaning of “pruning” of T and the meaning of “size” of 
h. A pruning h of T is a tree in which some internal nodes of T have been turned into 
leaves, labeled “+” or “—” depending on whether the majority of examples in S that reach 
that node are positive or negative. Let size(h) = L(h) log(n) where L(h) is the number of 
leaves in h. 

Hint #1: it is sufficient, for each integer L = 1,2,..., L(T), to find the pruning of T 
with L leaves of lowest empirical error on S, that is, hy = argminy..n)=Lerrs(h). Then 
plug them all into the displayed formula above and pick the best formula. 

Hint #2: use dynamic programming. 


Exercise 5.15 Consider the instance space X = R?. What is the VC-dimension of right 
corners with axis aligned edges that are oriented with one edge going to the right and the 
other edge going up? 


Exercise 5.16 (VC-dimension; Section 5.5) What is the VC-dimension V of the 
class H of axis-parallel boxes in R4? That is, H = {hapla, b € RI} where hay(x) = 1 if 
ai < xi < b; for alli =1,...,d and hap(X) = —1 otherwise. Select a set of points V that 
is shattered by the class and 


1. prove that the VC-dimension is at least |V| by proving V shattered, and 


2. prove that the VC-dimension is at most |V| by proving that no set of |V| +1 points 
can be shattered. 


Exercise 5.17 VC-dimension Prove that the VC-dimension of circles in the plane is 
three. 


Exercise 5.18 (VC-dimension, Perceptron, and Margins) A set of points S is 
“shattered by linear separators of margin y” if every labeling of the points in S is achievable 
by a linear separator of margin at least y. Prove that no set of 1/4? +1 points in the unit 
ball is shattered by linear separators of margin y. 

Hint: think about the Perceptron Algorithm and try a proof by contradiction. 


Exercise 5.19 Given two fully connected levels without a non-linear element, one can 


combine the two levels into one level. Can this be done for two convolution levels without 
pooling and without a nonlinear element? 
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Exercise 5.20 At present there are many interesting research directions in deep learning 
that are being explored. This exercise focuses on whether gates in networks learn the same 
thing independent of the architecture or how the network is trained. On the web there 
are several copies of Alexnet that have been trained starting from different random initial 
weights. Select two copies and form a matrix where the columns of the matrix correspond 
to gates in the first copy of Alexnet and the rows of the matrix correspond to gates of 
the same level in the second copy. The ijt? entry of the matrix is the covariance of the 
activation of the j*” gate in the first copy of Alexnet with the i gate in the second copy. 
The covariance is the expected value over all images in the data set. 


1. Match the gates in the two copies of the network using a bipartite graph matching 
algorithm. What is the fraction of matches that have a high covariance? 


2. Itis possible that there is no good one to one matching of gates but that some small 
set of gates in the first copy of the network learn what some small set of gates in the 
second copy learn. Explore a clustering technique to match sets of gates and carry 
out an experiment to do this. 


Exercise 5.21 


1. Input an image to a deep learning network. Reproduce the image from the activation 
vector, dimage, tt produced by inputting a random image and producing an activation 
vector Arandom. Then by gradient descent modify the pixels in the random image to 
minimize the error function |Qimage — @randoml”- 


2. Train a deep learning network to produce an image from an activation network. 


Exercise 5.22 


1. Create and train a simple deep learning network consisting of a convolution level with 
pooling, a fully connected level, and then softmax. Keep the network small. For input 
data use the MNIST data set http://yann.lecun.com/exdb/mnist/ with 28 x 28 
images of digits. Use maybe 20 channels for the convolution level and 100 gates for 
the fully connected level. 


2. Create and train a second network with two fully connected levels, the first level with 
200 gates and the second level with 100 gates. How does the accuracy of the second 
network compare to the first? 


3. Train the second network again but this time use the activation vector of the 100 
gate level and train the second network to produce that activation vector and only 
then train the softmax. How does the accuracy compare to direct training of the 
second network and the first network? 
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Exercise 5.23 (Perceptron and Stochastic Gradient Descent) We know the Per- 
ceptron Algorithm makes at most 1/y? updates on any sequence of examples that is sep- 
arable by margin y (assume all examples have length at most 1). However, it need not 
find a separator of large margin. If we also want to find a separator of large margin, a 
natural alternative is to update on any example x; such that li(w - Xi) < 1; this is called 
the margin perceptron algorithm. 


1. Argue why margin perceptron is equivalent to running stochastic gradient descent 
with learning rate 1 (on each example, add the negative gradient of the loss function 
to the current weight vector) on the class of linear predictors with hinge-loss as the 
loss function. 


2. Prove that on any sequence of examples that are separable by margin y, this algorithm 
will make at most 3/y? updates. 


3. In Part 2 you probably proved that each update increases |w|? by at most 3. Use this 
and your result from Part 2 to conclude that if you have a dataset S that is separable 
by margin y, and cycle through the data until the margin perceptron algorithm makes 
no more updates, that it will find a separator of margin at least y /3. 


Exercise 5.24 Consider running the Perceptron Algorithm in the online model on some 
sequence of examples S. Let S be the same set of examples as S but presented in a different 
order. Does the Perceptron Algorithm necessarily make the same number of mistakes on 
S as it does on S'? If so, why? If not, show such an S and S” consisting of the same set 
of examples in a different order where the Perceptron Algorithm makes a different number 
of mistakes on S” than it does on S. 


Exercise 5.25 (Sleeping Experts and Decision trees) “Pruning” a Decision Tree 
Online via Sleeping Experts: Suppose that, as in Exercise 5.14, we are given a decision tree 
T, but now we are faced with a sequence of examples that arrive online. One interesting 
way we can make predictions is as follows. For each node v of T (internal node or leaf) 
create two sleeping experts: one that predicts positive on any example that reaches v and 
one that predicts negative on any example that reaches v. So, the total number of sleeping 
experts is equal to the number of nodes of T which is proportional to the number of leaves 
L(T) of T. 
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1. Say why any pruning h of T, and any assignment of {+,—} labels to the leaves of h, 
corresponds to a subset of sleeping experts with the property that exactly one sleeping 
expert in the subset makes a prediction on any given example. 


2. Prove that for any sequence S of examples, and any given number of leaves L, if 
we run the sleeping-experts algorithm using e = y E, then the expected error 
rate of the algorithm on S (the total number of mistakes of the algorithm divided by 


|S|) will be at most errs(h,) + O( LT, where hy = argmin,.n) =Lerrs(h) 





is the pruning of T with L leaves of lowest error on S. 


3. In the above question, we assumed L was given. Explain how we can remove this as- 


sumption and achieve a bound of minz [errs(hr) + O( a) by instantiating 





L(T) copies of the above algorithm (one for each value of L) and then combining 
these algorithms using the experts algorithm (in this case, none of them will be 
sleeping). 


Exercise 5.26 (Boosting) Consider the boosting algorithm given in Figure 5.9. Suppose 
hypothesis h; has error rate B; on the weighted sample (S, w) for Bi much less than a y. 
Then, after the booster multiples the weight of misclassified examples by a, hypothesis hi 
will still have error less than 5 — y under the new weights. This means that h, could be 
given again to the booster (perhaps for several times in a row). Calculate, as a function of 
a and bı, approximately how many times in a row hy could be given to the booster before 
its error rate rises to above z — y. You may assume b, is much less than 5 — y. 

Note: The AdaBoost boosting algorithm [FS97] can be viewed as performing this ex- 
periment internally, multiplying the weight of misclassified examples by a, = Mao and 
then giving h; a weight proportional to the quantity you computed in its final majority-vote 


function. 
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6 Algorithms for Massive Data Problems: Stream- 
ing, Sketching, and Sampling 


6.1 Introduction 


This chapter deals with massive data problems where the input data is too large to be 
stored in random access memory. One model for such problems is the streaming model, 
where n data items a1, d2,...,@, arrive one at a time. For example, the a; might be 
IP addresses being observed by a router on the internet. The goal is to compute some 
statistics, property, or summary of these data items without using too much memory, 
much less than n. More specifically, we assume each a; itself is a b-bit quantity where b is 
not too large. For example, each a; might be an integer, 1 < a; < m, where m = 2°. The 
goal is to produce some desired output using space polynomial in b and logn; see Figure 
6.1. 


For example, a very easy problem to solve in the streaming model is to compute the 
sum of the a;. If each a; is an integer between 1 and m = 2°, then the sum of all the a; is 
an integer between 1 and mn and so the number of bits of memory needed to maintain 
the sum is O(b + logn). A harder problem, discussed shortly, is computing the number 
of distinct numbers in the input sequence. 


One natural approach for tackling a range of problems in the streaming model is to 
perform random sampling of the input “on the fly”. To introduce the basic flavor of 
sampling on the fly, consider a stream a,,43,..., 0, from which we are to select an index 
i with probability proportional to the value of a;. When we see an element, we do not 
know the probability with which to select it since the normalizing constant depends on 
all of the elements including those we have not yet seen. However, the following method 
works. Let s be the sum of the a;'s seen so far. Maintain s and an index į selected with 
Tepa M . Initially ¿ = 1 and s = a,. Having seen symbols aj, a2, ..., aj, s will equal 
a, +a +- + aj and for ¿in (1,..., j}, the selected index will be ¿ with probability Y 
On seeing a;+41, change the selected index to j + 1 with probability aa and otherwise 


Se 





keep the same index as before with probability 1 — . If we change the index to 7 + 1, 
clearly it was selected with the correct probabili T me keep į as our selection, then it 


will have been selected with probability 


Na Aj+1 Uira S Ui Qj 
S+FOj41) 8 Saj S SH+A;41 
which is the correct probability for selecting index +. Finally s is updated by adding a;+1 
to s. This problem comes up in many areas such as sleeping experts where there is a 


sequence of weights and we want to pick an expert with probability proportional to its 
weight. The a;'s are the weights and the subscript 7 denotes the expert. 





184 


stream @1,@2,..., Qn Algorithm some output 





(low space) 


Figure 6.1: High-level representation of the streaming model 


6.2 Frequency Moments of Data Streams 


An important class of problems concerns the frequency moments of data streams. As 
mentioned above, a data stream a;,,42,...,a, of length n consists of symbols a; from 
an alphabet of m possible symbols, which for convenience we denote as {1,2,...,m}. 
Throughout this section, n,m, and a; will have these meanings and s (for symbol) will 
denote a generic element of {1,2,...,m}. The frequency fs of the symbol s is the number 
of occurrences of s in the stream. For a non-negative integer p, the pt” frequency moment 


of the stream is T 
D 
s=1 


Note that the p = 0 frequency moment corresponds to the number of distinct symbols 
occurring in the stream using the convention 0° = 0. The first frequency moment is just 
n, the length of the string. The second frequency moment, X`, f?, is useful in computing 
the variance of the stream, i.e., the average squared difference from the average frequency. 


m 


Y ro) (A) 


s=1 


m 1/p 
In the limit as p becomes large, (= re) is the frequency of the most frequent ele- 
s=1 


ment(s). 


We will describe sampling based algorithms to compute these quantities for streaming 
data shortly. First a note on the motivation for these problems. The identity and fre- 
quency of the the most frequent item, or more generally, items whose frequency exceeds a 
given fraction of n, is clearly important in many applications. If the items are packets on 
a network with source and/or destination addresses, the high frequency items identify the 
heavy bandwidth users. If the data consists of purchase records in a supermarket, the high 
frequency items are the best-selling items. Determining the number of distinct symbols 
is the abstract version of determining such things as the number of accounts, web users, 
or credit card holders. The second moment and variance are useful in networking as well 
as in database and other applications. Large amounts of network log data are generated 
by routers that can record the source address, destination address, and the number of 
packets for all the messages passing through them. This massive data cannot be easily 
sorted or aggregated into totals for each source/destination. But it is important to know 
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if some popular source-destination pairs have a lot of traffic for which the variance is one 
natural measure. 


6.2.1 Number of Distinct Elements in a Data Stream 


Consider a sequence a1, d2,..., an of n elements, each a; an integer in the range 1 to m 
where n and m are very large. Suppose we wish to determine the number of distinct a; in 
the sequence. Each a; might represent a credit card number extracted from a sequence of 
credit card transactions and we wish to determine how many distinct credit card accounts 
there are. Note that this is easy to do in O(m) space by just storing a bit-vector that 
records which elements have been seen so far and which have not. It is also easy to do in 
O(nlog m) space by storing a list of all distinct elements that have been seen. However, 
our goal is to use space logarithmic in m and n. We first show that this is impossible 
using an exact deterministic algorithm. Any deterministic algorithm that determines the 
number of distinct elements exactly must use at least m bits of memory on some input 
sequence of length O(m). We then will show how to get around this problem using ran- 
domization and approximation. 


Lower bound on memory for exact deterministic algorithm We show that any 
exact deterministic algorithm must use at least m bits of memory on some sequence of 
length m + 1. Suppose we have seen a;,..., dy, and our algorithm uses less than m bits 
of memory on all such sequences. There are 2” — 1 possible subsets of (1,2,...,m) that 
the sequence could contain and yet only 277! possible states of our algorithm’s memory. 
Therefore there must be two different subsets Sı and Sə that lead to the same memory 
state. If Si and S are of different sizes, then clearly this implies an error for one of the 
input sequences. On the other hand, if they are the same size and the next element is in 
Sı but not So, the algorithm will give the same answer in both cases and therefore must 
give an incorrect answer on at least one of them. 


Algorithm for the number of distinct elements To beat the above lower bound, 
consider approximating the number of distinct elements. Our algorithm will produce a 
number that is within a constant factor of the correct answer using randomization and 
thus a small probability of failure. Suppose the set S of distinct elements was chosen 
uniformly at random from {1,...,m}. Let min denote the minimum element in S. What 
is the expected value of min? If there was one distinct element, then its expected value 
would be roughly 4%. If there were two distinct elements, their expected value would be 
roughly 3. More generally, for a random set S, the expected value of the minimum is 
approximately rat See Figure 6.2. Solving min = wit yields |S| = =- — 1. This 
suggests keeping track of the minimum element in O(log m) space and using this equation 
to give an estimate of |S]. 


Converting the intuition into an algorithm via hashing In general, the set S 
might not have been chosen uniformly at random. If the elements of S were obtained 


186 


|S| + 1 subsets 





| 
UE, 


1S|+1 


Figure 6.2: Estimating the size of S from the minimum element in S which has value 
approximately iat: The elements of S partition the set {1,2,...,m} into |S] +1 subsets 
each of size approximately IGT 


by selecting the |S| smallest elements of {1,2,...,m}, the above technique would give a 
very bad answer. However, we can convert our intuition into an algorithm that works 
well with high probability on every sequence via hashing. Specifically, we will use a hash 
function h where 

h: {1,2,...,m} > {0,1,2,...,M — 1}, 


and then instead of keeping track of the minimum element a; € S, we will keep track of 
the minimum hash value. The question now is: what properties of a hash function do 
we need? Since we need to store h, we cannot use a totally random mapping since that 
would take too many bits. Luckily, a pairwise independent hash function, which can be 
stored compactly is sufficient. 


We recall the formal definition of pairwise independence below. But first recall that 
a hash function is always chosen at random from a family of hash functions and phrases 
like “probability of collision” refer to the probability in the choice of hash function. 


2-Universal (Pairwise Independent) Hash Functions Various applications use dif- 
ferent amounts of randomness. Full randomness for a vector x = (£1, %2,...%q) Where 
each x; € {0,1} would require selecting x uniformly at random from the set of all 2% 0,1 
vectors. However, if we only need that each x; be equally likely to be 0 or 1, we can select 
x from the set of two vectors ((0,0,...,0),(1,1,...,1)). If in addition, we want each pair 
of coordinates x; and x; to be statistically independent, we need a larger set to chose from 
which has the property that for each pair i and j, (x;,x,) is equally likely to be (0,0), 
(0,1), (1,0) or (1,1). 


A set of hash functions 
H = {h | h: {1,2,..., m} >40,1,2,..., M-1)) 


is 2-universal or pairwise independent if for all x and y in {1,2,...,m} with z Æ y, 
h(x) and h(y) are each equally likely to be any element of {0,1,2,..., M — 1} and are 
statistically independent. It follows that a set of hash functions H is 2-universal if and 
only if for all z and y in {1,2,...,m}, x Æ y, h(x) and h(y) are each equally likely to be 
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any element of (0,1,2,..., M — 1}, and for all w and z 
= Ait oll 
Prob(h (7) =w and h (y) =z) = 55. 
We now give an example of a 2-universal family of hash functions. Let M be a prime 
greater than m. For each pair of integers a and b in the range [0, M — 1], define a hash 


function 
ha (1) =ax+b (mod M). 


To store the hash function ha», store the two integers a and b. This requires only O(log M) 
space. To see that the family is 2-universal note that h(x) = w and h(y) = z if and only 


GO) (7) com, 


If x Æ y, the matrix ( : : ) is invertible modulo M.?" Thus 


0-69) oman 


and for each (£) there is a unique (5). Hence 


1 
Prob(h(x) = w and h(y) = 2) = mM 
and H is 2-universal. 
Analysis of distinct element counting algorithm Let b,,b>,...,b¿ be the distinct 


values that appear in the input. Select h from the 2-universal family of hash functions A. 
Then the set S = [h(b,),h(b3),...,h(ba)) is a set of d random and pairwise independent 
values from the set (0,1,2,...,M — 1}. We now show that 22 is a good estimate for d, 
the number of distinct elements in the input, where min = min(S). 


Lemma 6.1 With probability at least 2 — i the estimate of the number of distinct 
elements satisfies, d < at < ôd, Uhre min is the smallest element of S. That is, 
x < min < M, 


Proof: First, we show that Prob (min < x) < a+ g. This part does not require pairwise 
independence. 


i M = M 
Prob (min < 5) = Prob ES h (bj) < =) 


d 
<M Mi EH 1 d 
< X Giga ôd | < i oe f 
`L Prob (10) <$) a (Eal ) <d (= 57) 2e M 


27 The primality of M ensures that inverses of elements exist in Zi, and M > m ensures that if x 4 y, 
then x and y are not equal mod M. 
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Figure 6.3: Location of the minimum in the distinct counting algorithm. 


Next, we show that Prob (min > sit) < E This part will use pairwise independence. 


First, 
Prob (min > our) = Prob (Vk, h (bp) > oir) f 


For i = 1,2,...,d, define the indicator variable 


_ {0 tng) > 
"| 1 otherwise 


and let J 
y= y Yi. 
i=1 
6 


We want to show that Prob(y = 0) is small. Now Prob (y; = 1) > 5, E (yi) > 5, and 
E (y) > 6. For 2-way independent random variables, the variance of their sum is the 
sum of their variances. So Var (y) = dVar (yı). Further, since yı is 0 or 1, Var(y,) = 
El — ElyY)] = EQ?) - Ely) = Ely) — Ey) < E (y1). Thus Var(y) < E (y). 
By the Chebyshev inequality, 


M 
Prob (min > 4%) = Prob (ve h (by) > 6 ) 


d 
= Prob (y = 0) 
< Prob (ly — E (y)| > E (y) 
Var(y) 1 


1 
SPU = EW) = 6 


Since a > 6d with probability at most 4 | 4 and ~~ < d with probability at most t, 
d < Æ < 6d with probability at least 2 g. E 








6.2.2 Number of Occurrences of a Given Element. 


To count the number of occurrences of a given element in a stream requires at most 
log n space where n is the length of the stream. Clearly, for any length stream that occurs 


189 


in practice, one can afford logn space. For this reason, the following material may never 
be used in practice, but the technique is interesting and may give insight into how to solve 
some other problem. 


Consider a string of 0's and 1's of length n in which we wish to count the number of 
occurrences of 1’s. Clearly with logn bits of memory we could keep track of the exact 
number of 1’s. However, the number can be approximated with only log log n bits. 


Let m be the number of 1's that occur in the sequence. Keep a value k such that 2* 
is approximately the number m of occurrences. Storing k requires only log log n bits of 
memory. The algorithm works as follows. Start with k=0. For each occurrence of a 1, 
add one to k with probability 1/2*. At the end of the string, the quantity 2* — 1 is the 
estimate of m. To obtain a coin that comes down heads with probability 1/2*, flip a fair 
coin, one that comes down heads with probability 1/2, k times and report heads if the fair 
coin comes down heads in all k flips. 


Given k, on average it will take 2% ones before k is incremented. Thus, the expected 
number of 1’s to produce the current value of k is 1+2+4+---+2*-!=2* —1. 





6.2.3 Frequent Elements 


The Majority and Frequent Algorithms First consider the very simple problem of 
n people voting. There are m candidates, {1,2,...,m}. We want to determine if one 
candidate gets a majority vote and if so who. Formally, we are given a stream of integers 
1, 2,...,@n, each a; belonging to {1,2,...,m}, and want to determine whether there is 
some s € {1,2,...,m} which occurs more than n/2 times and if so which s. It is easy to 
see that to solve the problem exactly on read-once streaming data with a deterministic 
algorithm, requires Q(min(n,m)) space. Suppose n is even and the last n/2 items are 
identical. Suppose also that after reading the first n/2 items, there are two different sets 
of elements that result in the same content of our memory. In that case, a mistake would 
occur if the second half of the stream consists solely of an element that is in one set, but 
not in the other. If n/2 > m then there are at least 2” — 1 possible subsets of the first 
n/2 elements. If n/2 < m then there are ye ("’) subsets. By the above argument, the 
number of bits of memory must be at least the base 2 logarithm of the number of subsets, 
which is Q(min(m, n)). 


Surprisingly, we can bypass the above lower bound by slightly weakening our goal. 
Again let's require that if some element appears more than n/2 times, then we must 
output it. But now, let us say that if no element appears more than n/2 times, then our 
algorithm may output whatever it wants, rather than requiring that it output “no”. That 
is, there may be “false positives”, but no “false negatives”. 


Majority Algorithm 
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Store a; and initialize a counter to one. For each subsequent aj, if a; is the 
same as the currently stored item, increment the counter by one. If it differs, 
decrement the counter by one provided the counter is non-zero. If the counter 
is zero, then store a; and set the counter to one. 


To analyze the algorithm, it is convenient to view the decrement counter step as “elim- 
inating” two items, the new one and the one that caused the last increment in the counter. 
It is easy to see that if there is a majority element s, it must be stored at the end. If 
not, each occurrence of s was eliminated; but each such elimination also causes another 
item to be eliminated. Thus for a majority item not to be stored at the end, more than 
n items must have eliminated, a contradiction. 


Next we modify the above algorithm so that not just the majority, but also items 
with frequency above some threshold are detected. More specifically, the algorithm below 
finds the frequency (number of occurrences) of each element of {1,2,...,m} to within an 
additive term of 25. That is, for each symbol s, the algorithm produces a value f. in 
[fs — PE fs], where f, is the true number of occurrences of symbol s in the sequence. 
It will do so using O(klogn + klogm) space by keeping k counters instead of just one 


counter. 


Algorithm Frequent 


Maintain a list of items being counted. Initially the list is empty. For each 
item, if it is the same as some item on the list, increment its counter by one. 
If it differs from all the items on the list, then if there are less than k items 
on the list, add the item to the list with its counter set to one. If there are 
already k items on the list, decrement each of the current counters by one. 
Delete an element from the list if its count becomes zero. 


Theorem 6.2 At the end of Algorithm Frequent, for each s € {1,2,...,m}, its counter 
on the list f, satisfies fs € [fs — Pan fs]. If some s does not occur on the list, its counter 
is zero and the theorem asserts that fs < Pane 


Proof: The fact that f, < f, is immediate. To show f, > f,— Ga View each decrement 
counter step as eliminating some items. An item is eliminated if the current a; being read 
is not on the list and there are already k symbols different from it on the list; in this case, a; 
and k other distinct symbols are simultaneously eliminated. Thus, the elimination of each 
occurrence of an s € {1,2,...,m} is really the elimination of k + 1 items corresponding 
to distinct symbols. Thus, no more than n/(k + 1) occurrences of any symbol can be 


eliminated. It is clear that if an item is not eliminated, then it must still be on the list at 


the end. This proves the theorem. A 
Theorem 6.2 implies that we can compute the true frequency of every s € {1,2,...,m} 
to within an additive term of 25. 
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6.2.4 The Second Moment 


This section focuses on computing the second moment of a stream with symbols from 
{1,2,...,m}. Let f, denote the number of occurrences of the symbol s in the stream, 
and recall that the second moment of the stream is given by »>.., f?. To calculate the 
second moment, for each symbol s, 1 < s < m, independently set a random variable x, 
to +1 with probability 1/2. In particular, think of x, as the output of a random hash 
function h(s) whose range is just the two buckets {—1, 1}. For now, think of h as a fully 
independent hash function. Maintain a sum by adding x, to the sum each time the symbol 
s occurs in the stream. At the end of the stream, the sum will equal >, 25f,. The 
expected value of the sum will be zero where the expectation is over the choice of the +1 


value for the zs. 
E (È Ts s) =0 
s=1 


Although the expected value of the sum is zero, its actual value is a random variable and 
the expected value of the square of the sum is given by 


m 2 m m 
s=1 s=1 sft s=1 


The last equality follows since E (x.x) = E(a,)E(a:) = 0 for s 4 t, using pairwise 
independence of the random variables. Thus 


a= (Ses) 


is an unbiased estimator of X>? f? in that it has the correct expectation. Note that at 
this point we could use Markov’s inequality to state that Prob(a > 3377", f2) < 1/3, but 
we want to get a tighter guarantee. To do so, consider the second moment of a: 


ra 4 
E(a’) = E (È zaf) = E ` cmd) . 
s=1 


1<s,t,u,u<m 





The last equality is by expansion. Assume that the random variables x, are 4-wise inde- 
pendent, or equivalently that they are produced by a 4-wise independent hash function. 
Then, since the x, are independent in the last sum, if any one of s, u, t, or v is distinct 
from the others, then the expectation of the term is zero. Thus, we need to deal only 
with terms of the form 222? for t 4 s and terms of the form 2%. 


A ways of 


Each term in the above sum has four indices, s,t,u,v, and there are E 
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choosing two indices that have the same x value. Thus, 


nw (e (È Enano) er( Ear) 
s=1 


s=1 t=s+1 
=O) i Ide 
s=1 t=s+1 s=1 


<3 bs i) = 38" (a), 
Therefore, Var(a) = E(a?) — E?(a) < 2E*(a). 


Since the variance is comparable to the square of the expectation, repeating the process 
several times and taking the average, gives high accuracy with high probability. 


Theorem 6.3 The average x ofr = 5 estimates a1,...,a, using independent sets of 
4-way independent random variables is 
Var(x) 
Prob (|x — Elx) > eE < == <d. 
(je — Ba)| >sE(2)) < ¿za < 
Proof: The proof follows from the fact that taking the average of r independent repe- 
titions reduces variance by a factor of r, so that Var(x) < de?E?(x), and then applying 
Chebyshev’s inequality. a 


It remains to show that we can implement the desired 4-way independent random vari- 
ables using O(log m) space. We earlier gave a construction for a pairwise-independent set 
of hash functions; now we need 4-wise independence, though only into a range of {—1, 1}. 
Below we present one such construction. 


Error-Correcting codes, polynomial interpolation and limited-way indepen- 
dence Consider the problem of generating a random m-dimensional vector x of +1’s so 
that any four coordinates are mutually independent. Such an m-dimensional vector may 
be generated from a truly random “seed” of only O(logm) mutually independent bits. 
Thus, we need only store the O(logm) bits and can generate any of the m coordinates 
when needed. For any k, there is a finite field F with exactly 2* elements, each of which 
can be represented with k bits and arithmetic operations in the field can be carried out in 
O(k?) time. Here, k is the ceiling of log,m. A basic fact about polynomial interpolation 
is that a polynomial of degree at most three is uniquely determined by its value over 
any field F at four points. More precisely, for any four distinct points a1, d2,a3, a4 in F 
and any four possibly not distinct values b1, be, b3,b4 in F, there is a unique polynomial 
f(x) = fo + fix + fox? + fzx? of degree at most three, so that with computations done 
over F, fla) =b,1<:1<4. 
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The definition of the pseudo-random +1 vector x with 4-way independence is simple. 
Choose four elements fo, fi, fo, fz at random from F and form the polynomial f(s) = 
fo + fis + fos? + f3s?. This polynomial represents x as follows. For s = 1,2,...,m, £s 
is the leading bit of the k-bit representation of f(s).2 Thus, the m-dimensional vector x 
requires only O(k) bits where k = [log m]. 





Lemma 6.4 The x defined above has 4-way independence. 


Proof: Assume that the elements of F are represented in binary using +1 instead of the 
traditional 0 and 1. Let s, t, u, and v be any four coordinates of x and let a, PB, y, and 
ô have values in +1. There are exactly 2*-! elements of F whose leading bit is a and 
similarly for 8, y, and ô. So, there are exactly 2-1 4-tuples of elements bı, bz, b3, and 
b4 in F so that the leading bit of bı is a, the leading bit of ba is P, the leading bit of bs 
is y, and the leading bit of b4 is 6. For each such b1, b2, b3, and b4, there is precisely one 
polynomial f so that f(s) = bı, f(t) = ba, flu) = bz, and f(v) = by. The probability 
that £, =a, 4 = 8, £u = y, and x, = 6 is precisely 
94(k-1) 94(k-1) 1 


total number of f ~ 2k 16 





Four way independence follows since Prob(z, = a) = Prob(a; = 8) = Prob(x., = y) = 
Prob(x, = 6) = 1/2 and thus 


Prob(z, =) Prob(a; = 8)Prob(a, = y)Prob(z, = 0) 
= Prob(z, = 04 t4 = b, £u = y and zs = ô) a 


Lemma 6.4 describes how to get one vector x with 4-way independence. However, we 
need r = O(1/e*) mutually independent vectors. Choose r independent polynomials at 
the outset. 


To implement the algorithm with low space, store only the polynomials in memory. 
This requires 4k = O(logm) bits per polynomial for a total of O( 28) bits. When a 
symbol s in the stream is read, compute each polynomial at s to obtain the value for the 
corresponding value of the z, and update the running sums. x, is just the leading bit of 
the value of the polynomial evaluated at s. This calculation requires O(log m) time. Thus, 
we repeatedly compute the x, from the “seeds”, namely the coefficients of the polynomials. 


This idea of polynomial interpolation is also used in other contexts. Error-correcting 
codes is an important example. To transmit n bits over a channel which may introduce 
noise, one can introduce redundancy into the transmission so that some channel errors 
can be corrected. A simple way to do this is to view the n bits to be transmitted as 
coefficients of a polynomial f(x) of degree n — 1. Now transmit f evaluated at points 





?8Here we have numbered the elements of the field F s = 1,2,...,m. 
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1,2,3,...,n +m. At the receiving end, any n correct values will suffice to reconstruct 
the polynomial and the true message. So up to m errors can be tolerated. But even if 
the number of errors is at most m, it is not a simple matter to know which values are 
corrupted. We do not elaborate on this here. 


6.3 Matrix Algorithms using Sampling 


We now move from the streaming model to a model where the input is stored in 
memory, but because the input is so large, one would like to produce a much smaller 
approximation to it, or perform an approximate computation on it in low space. For 
instance, the input might be stored in a large slow memory and we would like a small 
“sketch” that can be stored in smaller fast memory and yet retains the important prop- 
erties of the original input. In fact, one can view a number of results from the chapter on 
machine learning in this way: we have a large population, and we want to take a small 
sample, perform some optimization on the sample, and then argue that the optimum 
solution on the sample will be approximately optimal over the whole population. In the 
chapter on machine learning, our sample consisted of independent random draws from 
the overall population or data distribution. Here we will be looking at matrix algorithms 
and to achieve errors that are small compared to the Frobenius norm of the matrix rather 
than compared to the total number of entries, we will perform non-uniform sampling. 


Algorithms for matrix problems like matrix multiplication, low-rank approximations, 
singular value decomposition, compressed representations of matrices, linear regression 
etc. are widely used but some require O(n?) time for n x n matrices. 


The natural alternative to working on the whole input matrix is to pick a random 
sub-matrix and compute with that. Here, we will pick a subset of columns or rows of the 
input matrix. If the sample size s is the number of columns we are willing to work with, 
we will do s independent identical trials. In each trial, we select a column of the matrix. 
All that we have to decide is what the probability of picking each column is. Sampling 
uniformly at random is one option, but it is not always good if we want our error to be 
a small fraction of the Frobenius norm of the matrix. For example, suppose the input 
matrix has all entries in the range [—1, 1] but most columns are close to the zero vector 
with only a few significant columns. Then, uniformly sampling a small number of columns 
is unlikely to pick up any of the significant columns and essentially will approximate the 
original matrix with the all-zeroes matrix. 





9There are, on the other hand, many positive statements one can make about uniform sampling. 
For example, suppose the columns of A are data points in an m-dimensional space (one dimension per 
row). Fix any k-dimensional subspace, such as the subspace spanned by the k top singular vectors. If 
we randomly sample O(k/e2) columns uniformly, by the VC-dimension bounds given in Chapter 6, with 
high probability for every vector v in the k-dimensional space and every threshold 7, the fraction of the 
sampled columns a that satisfy vfa > 7 will be within +e of the fraction of the columns a in the overall 
matrix A satisfying va > 7. 
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We will see that the “optimal” probabilities are proportional to the squared length of 
columns. This is referred to as length squared sampling and since its first discovery in the 
mid-90’s, has been proved to have several desirable properties which we will see. Note 
that all sampling we will discuss here is done with replacement. 


Two general notes on this approach: 

(i) We will prove error bounds which hold for all input matrices. Our algorithms 
are randomized, i.e., use a random number generator, so the error bounds are random 
variables. The bounds are on the expected error or tail probability bounds on large errors 
and apply to any matrix. Note that this contrasts with the situation where we have a 
stochastic model of the input matrix and only assert error bounds for “most” matrices 
drawn from the probability distribution of the stochastic model. A mnemonic is - our 
algorithms can toss coins, but our data does not toss coins. A reason for proving error 
bounds for any matrix is that in real problems, like the analysis of the web hypertext link 
matrix or the patient-genome expression matrix, it is the one matrix the user is interested 
in, not a random matrix. In general, we focus on general algorithms and theorems, not 
specific applications, so the reader need not be aware of what the two matrices above 
mean. 

(ii) There is “no free lunch”. Since we only work on a small random sample and not 
on the whole input matrix, our error bounds will not be good for certain matrices. For 
example, if the input matrix is the identity, it is intuitively clear that picking a few ran- 
dom columns will miss the other directions. 


To the Reader: Why aren’t (i) and (ii) mutually contradictory? 


6.3.1 Matrix Multiplication using Sampling 


Suppose A is an mx n matrix and B is an n x p matrix and the product AB is desired. 
We show how to use sampling to get an approximate product faster than the traditional 
multiplication. Let A (:, k) denote the k column of A. A(:,k) is am x 1 matrix. Let 
B(k,:) be the k row of B. B(k,:) isa 1 x n matrix. It is easy to see that 


AB = Scie (k,:). 


Note that for each value of k, A(:,k)B(k,:) is an m x p matrix each element of which is a 
single product of elements of A and B. An obvious use of sampling suggests itself. Sample 
some values for k and compute A (:, k) B (k,:) for the sampled k’s and use their suitably 
scaled sum as the estimate of AB. It turns out that non-uniform sampling probabilities 
are useful. Define a random variable z that takes on values in {1,2,...,n}. Let p denote 
the probability that z assumes the value k. We will solve for a good choice of probabilities 
later, but for now just consider the pz, as non-negative numbers that sum to one. Define 
an associated random matrix variable that has value 


E ZAG, E) B (ks) (6.1) 
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with probability pz. Let E (X) denote the entry-wise expectation. 


Es > prole = PZA (:,k) B (k,:) = E (:,k)B (k,:) = AB. 


This explains the scaling by a in X. In particular, X is a matrix-valued random variable 
each of whose components is correct in expectation. We will be interested in 


E (IAB — XI[z) 
This can be viewed as the variance of X, defined as the sum of the variances of all its 
entries. 


Var(X) = y 2 Var (zi) = Do? CA — Els) = (= Engt) — || ABI». 


i=1 j=l 


We want to choose pp to minimize this quantity, and notice that we can ignore the || 4 B||% 
term since it doesn’t depend on the px's at all. We can now simplify by exchanging the 
order of summations to get 


mall, = Y (x) (= %,) =E A Gw PIB œ) P. 


tj 





What is the best choice of py to minimize this sum? It can be seen by calculus? that the 
minimizing pz are proportional to |A(:, k)||B(k,:)|. In the important special case when 
B = A’, pick columns of A with probabilities proportional to the squared length of the 
columns. Even in the general case when B is not AT, doing so simplifies the bounds. 
This sampling is called “length squared sampling”. If p, is proportional to |A (:, k) |”, ie, 


Pk = Aia. then 
AIE 


E (||AB — X||}) = Var(X) < J1AI2) IB (k, >)? =/1411+1181[%. 
k 


To reduce the variance, we can do s independent trials. Each trial 7, 7 = 1,2,...,s 
yields a matrix X; as in (6.1). We take 1)7;_, X; as our estimate of AB. Since the 
variance of a sum of independent random variables is the sum of variances, the variance 
of 2) 7;_, X; is Var(X) and so is at most +||A||2||B||%. Let kı, ..., ka be the k's chosen 
in each trial. Expanding this, gives: 


Dr : Games ACh) B(ka) He nes, T 





Pky Pk Pks 





30By taking derivatives, for any set of non-negative numbers cp, < ig minimized with pj propor- 
y g > y g k k 


P 
tional to ,/Cx. 
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Figure 6.4: Approximate Matrix Multiplication using sampling 


We will find it convieneint to write this as the product of an m x s matrix with a s x p 
matrix as follows: Let C be the m x s matrix consisting of the following columns which 
are scaled versions of the chosen columns of A: 


A(:,k1) AC, ka) A(:, ks) 
y SPkı í y SPko a V SPks l 


Note that the scaling has a nice property, which the reader can verify: 





E (CCT) = AAF. (6.3) 


Define R to be the s x p matrix with the corresponding rows of B similarly scaled, namely, 


R has rows 
B(k1,:) B(kə2,:) B(ks,:) 


y SPkı i V SPka q y SPks l 





The reader may verify that 
E(RTR) = B'B. (6.4) 


From (6.2), we see that + )77_, X; = CR. This is represented in Figure 6.4. We summarize 
our discussion in Theorem 6.3.1. 


Theorem 6.5 Suppose A is an m x n matrix and B is an n x p matriz. The product 
AB can be estimated by CR, where C is an m x s matrix consisting of s columns of A 
picked according to length-squared distribution and scaled to satisfy (6.3) and R is the 
s x p matrix consisting of the corresponding rows of B scaled to satisfy (6.4). The error 
is bounded by: 
Al\l2 ||Bl|2 
S 

Thus, to ensure E (|| AB — CRIS) < €?||A||}||B||%, it suffices to make s greater than or 
equal to 1/e?. Ife is Q(1), so s € O(1), then the multiplication CR can be carried out in 
time O(mp). 


198 


When is this error bound good and when is it not? Let's focus on the case that B = AT 
so we have just one matrix to consider. If A is the identity matrix, then the guarantee is 
not very good. In this case, || AA7||?, = n, but the right-hand-side of the inequality is = 
So we would need s > n for the bound to be any better than approximating the product 
with the zero matrix. 


More generally, the trivial estimate of the all zero matrix for AAT makes an error in 
Frobenius norm of ||AA7||~. What s do we need to ensure that the error is at most this? 
If 0,,02,... are the singular values of A, then the singular values of AAT are o7,03,... 


and 
IAAT = Doo? and [All =>) o. 
t t 
So from Theorem 6.3.1, E(|| AAT — CR]||}) < || AA7||?, provided 


(o? +03+...)? 
oi togt+... | 





If rank(A) = r, then there are r non-zero o; and the best general upper bound on the 
(o2+o02+...)? 
ar 
sampling will not gain us anything over taking the whole matrix! 


ratio is r, so in general, s needs to be at least r. If A is full rank, this means 


However, if there is a constant c and a small integer p such that 
ato toba cdo Po ER): (6.5) 


then, 








21 32 232 
(oF +03 +...) < hata t--- + %) A 
oltott.. 7  oi+oit... +0? 


and so s > œp gives us a better estimate than the zero matrix. Increasing s by a factor 
decreases the error by the same factor. Condition 6.5 is indeed the hypothesis of the 
subject of Principal Component Analysis (PCA) and there are many situations when the 
data matrix does satisfy the condition and sampling algorithms are useful. 


6.3.2 Implementing Length Squared Sampling in Two Passes 


Traditional matrix algorithms often assume that the input matrix is in random access 
memory (RAM) and so any particular entry of the matrix can be accessed in unit time. 
For massive matrices, RAM may be too small to hold the entire matrix, but may be able 
to hold and compute with the sampled columns and rows. 


Consider a high-level model where the input matrix or matrices have to be read from 


external memory using one pass in which one can read sequentially all entries of the ma- 
trix and sample. 
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It is easy to see that two passes suffice to draw a sample of columns of A according 
to length squared probabilities, even if the matrix is not in row-order or column-order 
and entries are presented as a linked list. In the first pass, compute the length squared of 
each column and store this information in RAM. The lengths squared can be computed as 
running sums. Then, use a random number generator in RAM to determine according to 
length squared probability the columns to be sampled. Then, make a second pass picking 
the columns to be sampled. 


If the matrix is already presented in external memory in column-order, then one pass 
will do. The idea is to use the primitive in Section 6.1: given a read-once stream of 
positive numbers 41, 42,..., Gn, at the end have ani € {1,2,...,n} such that the proba- 


bility that ¿ was chosen is sm Filling in the specifics is left as an exercise for the reader. 
J 


=1 j 





6.3.3 Sketch of a Large Matrix 


The main result of this section is that for any matrix, a sample of columns and rows, 
each picked according to length squared distribution provides a good sketch of the matrix. 
Let A be an m x n matrix. Pick s columns of A according to length squared distribution. 
Let C be the m x s matrix containing the picked columns scaled so as to satisy (6.3), i.e., 
if A(:, k) is picked, it is scaled by 1/,/sp,. Similarly, pick r rows of A according to length 
squared distribution on the rows of A. Let R be the r xn matrix of the picked rows, scaled 
as follows: If row k of A is picked, it is scaled by 1/,/rpx. We then have E(RTR) = ATA. 
From C and R, one can find a matrix U so that A ~ CUR. The schematic diagram is 
given in Figure 6.5. 


One may recall that the top k singular vectors of the SVD of A give a similar picture; 
however, the SVD takes more time to compute, requires all of A to be stored in RAM, 
and does not have the property that the rows and columns are directly from A. The last 
property, that the approximation involves actual rows/columns of the matrix rather than 
linear combinations, is called an interpolative approximation and is useful in many con- 
texts. On the other hand, the SVD yields the best 2-norm approximation. Error bounds 
for the approximation CUR are weaker. 


We briefly touch upon two motivations for such a sketch. Suppose A is the document- 
term matrix of a large collection of documents. We are to “read” the collection at the 
outset and store a sketch so that later, when a query represented by a vector with one 
entry per term arrives, we can find its similarity to each document in the collection. 
Similarity is defined by the dot product. In Figure 6.5 it is clear that the matrix-vector 
product of a query with the right hand side can be done in time O(ns + sr + rm) which 
would be linear in n and m if s and r are O(1). To bound errors for this process, we 
need to show that the difference between A and the sketch of A has small 2-norm. Re- 
call that the 2-norm ||A||2 of a matrix A is max |Ax|. The fact that the sketch is an 


x|=1 
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Figure 6.5: Schematic diagram of the approximation of A by a sample of s columns and 
r Lows. 


interpolative approximation means that our approximation essentially consists a subset 
of documents and a subset of terms, which may be thought of as a representative set of 
documents and terms. Additionally, if A is sparse in its rows and columns, each document 
contains only a small fraction of the terms and each term is in only a small fraction of 
the documents, then this sparsity property will be preserved in C and R, unlike with SVD. 


A second motivation comes from analyzing gene microarray data. Here, A is a matrix 
in which each row is a gene and each column is a condition. Entry (i,j) indicates the 
extent to which gene 1 is expressed in condition j. In this context, a CU R decomposition 
provides a sketch of the matrix A in which rows and columns correspond to actual genes 
and conditions, respectively. This can often be easier for biologists to interpret than a 
singular value decomposition in which rows and columns would be linear combinations of 
the genes and conditions. 


It remains now to describe how to find U from C and R. There is a n x n matrix P 
of the form P = QR that acts as the identity on the space spanned by the rows of R and 
zeros out all vectors orthogonal to this space. We state this now and postpone the proof. 


Lemma 6.6 If RR? is invertible, then P = R™(RR*™)“!R has the following properties: 


(i) It acts as the identity matriz on the row space of R. I.e., Px =x for every vector x 
of the form x = RTy (this defines the row space of R). Furthermore, 


(ii) if x is orthogonal to the row space of R, then Px = 0. 
If RR" is not invertible, let rank (RRT) = r and RR? = Y _, o,urvi’ be the SVD of 


RR”. Then, 
“1 
PS RT bs Jens" R 
0 


t=1 t 


satisfies (1) and (11). 
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We begin with some intuition. In particular, we first present a simpler idea that does 
not work, but that motivates an idea that does. Write A as AJ, where I is the n x n 
identity matrix. Approximate the product AJ using the algorithm of Theorem 6.3.1, i.e., 
by sampling s columns of A according to a length-squared distribution. Then, as in the 
last section, write AI ~ CW, where W consists of a scaled version of the s rows of I 
corresponding to the s columns of A that were picked. Theorem 6.3.1 bounds the error 
|| A—CW||z by 114117117/1%/s = 2]14]/+. But we would like the error to be a small fraction 
of ||A||% which would require s > n, which clearly is of no use since this would pick as 
many or more columns than the whole of A. 


Let’s use the identity-like matrix P instead of J in the above discussion. Using the 
fact that R is picked according to length squared sampling, we will show the following 
proposition later. 


Proposition 6.7 A ~ AP and the error E (|| A — AP||3) is at most LA! ; 


We then use Theorem 6.3.1 to argue that instead of doing the multiplication AP, we can 
use the sampled columns of A and the corresponding rows of P. The s sampled columns 
of A form C. We have to take the corresponding s rows of P = RY(RRT)— UR, which is 
the same as taking the corresponding s rows of RT, and multiplying this by (RR?)—!R. It 
is easy to check that this leads to an expression of the form CUR. Further, by Theorem 
6.3.1, the error is bounded by 





ANSP or 
E (IAP — CURIE) < E (AP - curg) < PIPA < Tag, (60) 


since we will show later that: 
Proposition 6.8 ||P||% < r. 


Putting (6.6) and Proposition 6.7 together, and using the fact that by triangle inequality 
|| A—CUR||2 < || A — AP 2 + ||AP — CUR]||2, which in turn implies that || A —CUR||? < 
2/14 — AP||3 + 2||AP — CUR]|l3, the main result below follows. 


Theorem 6.9 Let A be anm x n matrix and r and s be positive integers. Let C be an 
m x s matriz of s columns of A picked according to length squared sampling and let R be 
a matriz of r rows of A picked according to length squared sampling. Then, we can find 
from C and R ans xr matriz U so that 


2 2r 
E (|| A- 23) < jall =+). 
(A - cur) < Nall (+ =) 


If s is fixed, the error is minimized when r = s?%%. Choosing s = 1/e? and r = 1/e?, 


the bound becomes O(e)|| A||. When is this bound meaningful? We discuss this further 
after first proving all the claims used in the discussion above. 


202 


Proof of Lemma 6.6: First consider the case that RR” is invertible. For x = R’y, 
RU(RRIYARx = R™(RRT)'RR'y = R'y = x. If x is orthogonal to every row of R, 
then Rx = 0, so Px = 0. More generally, if RR? = Y ouivi”, then, RUY, ak = 


Y, vivi” and clearly satisfies (i) and (ii). E 
Next we prove Proposition 6.7. First, recall that 


||A — AP||? = m (A — AP)x|?. 
Now suppose x is in the row space V of R. From Lemma 6.6, Px = x, so for x € V, 
(A— AP)x = 0. Since every vector can be written as a sum of a vector in V plus a vector 
orthogonal to V, this implies that the maximum must therefore occur at some x € V+. 
For such x, by Lemma 6.6, (A— AP)x = Ax. Thus, the question becomes: for unit-length 
x € V+, how large can |Ax|? be? To analyze this, write: 


[Ax]? = xT AT Ax = x7(A?A — R?R)x < ||ATA — R" R||2|x|? < ||A7A — RT RI lo. 


This implies that ||A — AP||? < ||A7A — R?R||o. So, it suffices to prove that ||47 A — 
RT R||2 < ||A||%/r which follows directly from Theorem 6.3.1, since we can think of RTR 
as a way of estimating ATA by picking according to length-squared distribution columns 
of A’, i.e., rows of A. This proves Proposition 6.7. 


Proposition 6.8 is easy to see. By Lemma 6.6, P is the identity on the space V spanned 
by the rows of R, and Px = 0 for x perpendicular to the rows of R. Thus ||P||% is the 
sum of its singular values squared which is at most r as claimed. 


We now briefly look at the time needed to compute U. The only involved step in 
computing U is to find (RR)! or do the SVD of RRT. But note that RR? is an r xr 
matrix and since r is much smaller than n and m, this is fast. 


Understanding the bound in Theorem 6.9: To better understand the bound in 
Theorem 6.9 consider when it is meaningful and when it is not. First, choose parameters 
s = O(1/e?) and r = O(1/e?) so that the bound becomes E(||A — CUR?) < e||A||%. 
Recall that ||A]|} = X}; 0?(A), i.e., the sum of squares of all the singular values of A. 


Also, for convenience scale A so that 0?(A) = 1. Then 


(A) = ||Al[g=1 and E(\|A—CUR||3) Se) 07 (A). 


This, gives an intuitive sense of when the guarantee is good and when it is not. If the 
top k singular values of A are all Q(1) for k > m'/3, so that D,0?(4) > m3, then 
the guarantee is only meaningful when e = o(m~'/3), which is not interesting because it 
requires s > m. On the other hand, if just the first few singular values of A are large 
and the rest are quite small, e.g, A represents a collection of points that lie very close 
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Figure 6.6: Samples of overlapping sets A and B. 


to a low-dimensional subspace, and in particular if }>,0?(A) is a constant, then to be 
meaningful the bound requires e to be a small constant. In this case, the guarantee is 
indeed meaningful because it implies that a constant number of rows and columns provides 
a good 2-norm approximation to A. 


6.4 Sketches of Documents 


Suppose one wished to store all the web pages from the World Wide Web. Since there 
are billions of web pages, one might want to store just a sketch of each page where a sketch 
is some type of compact description that captures sufficient information to do whatever 
task one has in mind. For the current discussion, we will think of a web page as a string 
of characters, and the task at hand will be one of estimating similarities between pairs of 
web pages. 


We begin this section by showing how to estimate similarities between sets via sam- 
pling, and then how to convert the problem of estimating similarities between strings into 
a problem of estimating similarities between sets. 


Consider subsets of size 1000 of the integers from 1 to 10%. Suppose one wished to 
compute the resemblance of two subsets A and B by the formula 





resemblance (4, B) = Er 


Suppose that instead of using the sets A and B, one sampled the sets and compared 
random subsets of size ten. How accurate would the estimate be? One way to sample 
would be to select ten elements uniformly at random from A and B. Suppose A and B 
were each of size 1000, over lapped by 500, and both were represented by six samples. 
Even though half of the six samples of A were in B they would not likely be among the 
samples representing B. See Figure 6.6. This method is unlikely to produce overlapping 
samples. Another way would be to select the ten smallest elements from each of A and 
B. If the sets A and B overlapped significantly one might expect the sets of ten smallest 
elements from each of A and B to also overlap. One difficulty that might arise is that the 
small integers might be used for some special purpose and appear in essentially all sets 
and thus distort the results. To overcome this potential problem, rename all elements 
using a random permutation. 
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Suppose two subsets of size 1000 overlapped by 900 elements. What might one expect 
the overlap of the 10 smallest elements from each subset to be? One would expect the 
nine smallest elements from the 900 common elements to be in each of the two sampled 
subsets for an overlap of 90%. The expected resemblance(A, B) for the size ten sample 
would be 9/11=0.81. 


Another method would be to select the elements equal to zero mod m for some inte- 
ger m. If one samples mod m the size of the sample becomes a function of n. Sampling 
mod m allows one to handle containment. 


In another version of the problem one has a string of characters rather than a set. 
Here one converts the string into a set by replacing it by the set of all of its substrings 
of some small length k. Corresponding to each string is a set of length k substrings. If 
k is modestly large, then two strings are highly unlikely to give rise to the same set of 
substrings. Thus, we have converted the problem of sampling a string to that of sampling 
a set. Instead of storing all the substrings of length k, we need only store a small subset 
of the length k substrings. 


Suppose you wish to be able to determine if two web pages are minor modifications 
of one another or to determine if one is a fragment of the other. Extract the sequence of 
words occurring on the page, viewing each word as a character. Then define the set of 
substrings of k consecutive words from the sequence. Let S(D) be the set of all substrings 
of k consecutive words occurring in document D. Define resemblance of A and B by 


resemblance (A, B) = BOSE 


And define containment as 


: — |S(A)NS(B)| 
containment (A, B) = SCANT 
Let 7 be a random permutation of all length k substrings. Define F'(A) to be the s 
smallest elements of A and V(A) to be the set mod m in the ordering defined by the 
permutation. 


Then 
F(A) F(B) 
F(A) U F(B) 
and 
|V(A)nV(B)| 
|V(A)UV(B)| 


are unbiased estimates of the resemblance of A and B. The value 


|V(A)AV(B)| 
IV(A)| 


is an unbiased estimate of the containment of A in B. 
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The hashing-based algorithm for counting the number of distrinct elements in a data 
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further research in the area. Improvements and generalizations of Algorithm Frequent 
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Length-squared sampling was introduced by Frieze, Kannan and Vempala [FKV04]; 
the algorithms of Section 6.3 are from [DKM06a, DKMO06b]. The material in Section 6.4 
on sketches of documents is from Broder et al. [BGMZ97]. 
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6.6 Exercises 
Exercise 6.1 Let a;,az,...,a,, be a stream of symbols each an integer in {1,...,m}. 


1. Give an algorithm that will select a symbol uniformly at random from the stream. 
How much memory does your algorithm require? 


2. Give an algorithm that will select a symbol with probability proportional to a?. 


Exercise 6.2 How would one pick a random word from a very large book where the prob- 
ability of picking a word is proportional to the number of occurrences of the word in the 
book? 


Exercise 6.3 Consider a matriz where each element has a probability of being selected. 
Can you select a row according to the sum of probabilities of elements in that row by just 
selecting an element according to its probability and selecting the row that the element is 
in? 


Exercise 6.4 For the streaming model give an algorithm to draw t independent samples 
of indices i, each with probability proportional to the value of a;. Some images may be 
drawn multiple times. What is its memory usage? 


Exercise 6.5 Randomly generate 1,000 integers in the range [1, 10%] and calculate 10°/min 
100 times. What is the probability that the estimate for the number of distinct elements 
in the range [167. 6,000]? How would you improve the probability? 


Exercise 6.6 For some constant c > 0, it is possible to create 2°” subsets of {1,..., m}, 
each with m/2 elements, such that no two of the subsets share more than 3m/8 elements 
in common. Use this fact to argue that any deterministic algorithm that even guarantees 
to approximate the number of distinct elements in a data stream with error less than 7% 
must use Q(m) bits of memory on some input sequence of length n < 2m. 


Exercise 6.7 Consider an algorithm that uses a random hash function and gives an 
estimate ĉ of the true value x of some variable. Suppose that | < T < 4x with probability 
at least 0.6. The probability of the estimate is with respect to choice of the hash function. 
How would you improve the probability that | < % < 4x to 0.8? Hint: Since we do not 
know the variance taking average may not help and we need to use some other function 
of multiple runs. 


Exercise 6.8 Give an example of a set H of hash functions such that h(x) is equally 
likely to be any element of {0,..., M — 1} (H is 1-universal) but H is not 2-universal. 


Exercise 6.9 Let p be a prime. A set of hash functions 





31For example, choosing them randomly will work with high probability. You expect two subsets of size 
m/2 to share m/4 elements in common, and with high probability they will share no more than 3m/8. 
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A A ayy 


is 3-universal if for all x,y,2,u,v,w in {0,1,...,p —1}, where x,y,z are distinct we have 
1 
Prob(h = t h(y) =v, h(z)= w) => 
P 


(a) Is the set {ha(x) =ax+b modp|0<a,b< p} of hash functions 3-universal? 


(b) Give a 3-universal set of hash functions. 


Exercise 6.10 Select a value for k and create a set 
H = {x|x = (£1, £2,..., £p), zi € {0,1,...,k — 1}} 


where the set of vectors H is pairwise independent and |H| < k*. We say that a set of vec- 
tors is pairwise independent if for any subset of two of the coordinates, all of the k? possible 
pairs of values that could appear in those coordinates such as (0,0), (0,1),..., (1,0), (4, 1),... 
occur the exact same number of times. 


Exercise 6.11 Consider a coin that comes down heads with probability p. Prove that the 
expected number of flips needed to see a heads is 1/p. 


Exercise 6.12 Randomly generate a string £1£2`-- £n of 10% 0’s and 1's with probability 
1/2 of x; being a 1. Count the number of ones in the string and also estimate the number of 
ones by the coin-flip approximate counting algorithm, in Section 6.2.2. Repeat the process 
for p=1/4, 1/8, and 1/16. How close is the approximation? 


Counting Frequent Elements 
The Majority and Frequent Algorithms 
The Second Moment 


Exercise 6.13 


1. Construct an example in which the majority algorithm gives a false positive, i.e., 
stores a non-majority element at the end. 


2. Construct an example in which the frequent algorithm in fact does as badly as in the 
theorem, i.e., it under counts some item by n/(k+1). 


Exercise 6.14 Let p be a prime and n > 2 be an integer. What representation do you 
use to do arithmetic in the finite field with p” elements? How do you do addition? How 
do you do multiplication? 


Error-Correcting codes, polynomial interpolation and limited-way indepen- 
dence 
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Exercise 6.15 Let F be a field. Prove that for any four distinct points a1, a, 43, and a4 
in F and any four possibly not distinct values b,,b2,b3, and b4 in F, there is a unique 
polynomial f(x) = fo + fix + fox? + fsx? of degree at most three so that f(a;) = bi, 
1<i<4 with all computations done over F. If you use the Vandermonde matrix you 
can use the fact that the matrix is non-singular. 





Sketch of a Large Matrix 


Exercise 6.16 Suppose we want to pick a row of a matriz at random where the probability 
of picking row i is proportional to the sum of squares of the entries of that row. How would 
we do this in the streaming model? 


(a) Do the problem when the matrix is given in column order. 


(b) Do the problem when the matriz is represented in sparse notation: it is just presented 
as a list of triples (i, j, aij), in arbitrary order. 


Matrix Multiplication Using Sampling 


Exercise 6.17 Suppose A and B are two matrices. Prove that AB = X` A(:,k)B (k,:). 

k=1 
Exercise 6.18 Generate two 100 by 100 matrices A and B with integer values between 
1 and 100. Compute the product AB both directly and by sampling. Plot the difference 
in Ly norm between the results as a function of the number of samples. In generating 
the matrices make sure that they are skewed. One method would be the following. First 
generate two 100 dimensional vectors a and b with integer values between 1 and 100. Next 
generate the i” row of A with integer values between 1 and a; and the i column of B 
with integer values between 1 and bi. 


Approximating a Matrix with a Sample of Rows and Columns 

Exercise 6.19 Suppose ay, d2,...,Am are non-negative reals. Show that the minimum 
m 

of » a subject to the constraints x, > 0 and Y a, = 1 is attained when the x, are 
k=1 k 

proportional to ,/ax. 


Sketches of Documents 


Exercise 6.20 Construct two different strings of 0’s and 1’s having the same set of sub- 
strings of length k = 3. 


Exercise 6.21 (Random strings, empirical analysis). Consider random strings of length 
n composed of the integers 0 through 9, where we represent a string x by its set S(x) 
of length k-substrings. Perform the following experiment: choose two random strings x 
and y of length n = 10,000 and compute their resemblance ely for k= DB 
What does the graph of resemblance as a function of k look like: 
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Exercise 6.22 (Random strings, theoretical analysis). Consider random strings of length 
n composed of the integers 0 through 9, where we represent a string x by its set S(x) of 
length k-substrings. Consider now drawing two random strings x and y of length n and 
; q [569055 (y)| 
computing their resemblance US)" 
1. Prove that for k < 5 logio(n), with high probability as n goes to infinity the two 
strings have resemblance equal to 1. 


2. Prove that for k > 3log,.(n), with high probability as n goes to infinity the two 
strings have resemblance equal to 0. 


Exercise 6.23 Discuss how you might go about detecting plagiarism in term papers. 


Exercise 6.24 Suppose you had one billion web pages and you wished to remove dupli- 
cates. How might you do this? 


Exercise 6.25 Consider the following lyrics: 


When you walk through the storm hold your head up high and don’t be afraid 
of the dark. At the end of the storm there’s a golden sky and the sweet silver 
song of the lark. 

Walk on, through the wind, walk on through the rain though your dreams be 
tossed and blown. Walk on, walk on, with hope in your heart and you'll never 
walk alone, you'll never walk alone. 


How large must k be to uniquely recover the lyric from the set of all length k subsequences 
of symbols? Treat the blank as a symbol. 


Exercise 6.26 Blast: Given a long string A, say of length 10° and a shorter string B, 
say 10°, how do we find a position in A which is the start of a substring B' that is close 
to B? This problem can be solved by dynamic programming in polynomial time, but find 
a faster algorithm to solve this problem. 

Hint: (Shingling approach) One possible approach would be to fiz a small length, say 
seven, and consider the shingles of A and B of length seven. If a close approximation to 
B is a substring of A, then a number of shingles of B must be shingles of A. This should 
allows us to find the approximate location in A of the approximation of B. Some final 
algorithm should then be able to find the best match. 
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7 Clustering 


7.1 Introduction 


Clustering refers to partitioning a set of objects into subsets according to some de- 
sired criterion. Often it is an important step in making sense of large amounts of data. 
Clustering comes up in many contexts. One might want to partition a set of news articles 
into clusters based on the topics of the articles. Given a set of pictures of people, one 
might want to group them into clusters based on who is in the image. Or one might want 
to cluster a set of protein sequences according to the protein function. A related problem 
is not finding a full partitioning but rather just identifying natural clusters that exist. 
For example, given a collection of friendship relations among people, one might want to 
identify any tight-knit groups that exist. In some cases we have a well-defined correct 
answer, e.g., in clustering photographs of individuals by who is in them, but in other cases 
the notion of a good clustering may be more subjective. 


Before running a clustering algorithm, one first needs to choose an appropriate repre- 
sentation for the data. One common representation is as vectors in R¢. This corresponds 
to identifying d real-valued features that are then computed for each data object. For ex- 
ample, to represent documents one might use a “bag of words” representation, where each 
feature corresponds to a word in the English language and the value of the feature is how 
many times that word appears in the document. Another common representation is as 
vertices in a graph, with edges weighted by some measure of how similar or dissimilar the 
two endpoints are. For example, given a set of protein sequences, one might weight edges 
based on an edit-distance measure that essentially computes the cost of transforming one 
sequence into the other. This measure is typically symmetric and satisfies the triangle 
inequality, and so can be thought of as a finite metric. A point worth noting up front 
is that often the “correct” clustering of a given set of data depends on your goals. For 
instance, given a set of photographs of individuals, we might want to cluster the images by 
who is in them, or we might want to cluster them by facial expression. When representing 
the images as points in space or as nodes in a weighted graph, it is important that the 
features we use be relevant to the criterion we care about. In any event, the issue of how 
best to represent data to highlight the relevant information for a given task is generally 
addressed using knowledge of the specific domain. From our perspective, the job of the 
clustering algorithm begins after the data has been represented in some appropriate way. 


In this chapter, our goals are to discuss (a) some commonly used clustering algorithms 
and what one can prove about them, and (b) models and assumptions on data under which 
we can find a clustering close to the correct clustering. 


7.1.1 Preliminaries 


We will follow the standard notation of using n to denote the number of data points 
and k to denote the number of desired clusters. We will primarily focus on the case that 
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k is known up front, but will also discuss algorithms that produce a sequence of solutions, 
one for each value of k, as well as algorithms that produce a cluster tree that can encode 
multiple clusterings at each value of k. We will generally use A = fa,,...,a,) to denote 
the n data points. We also think of A as a matrix with rows ay,..., an. 


7.1.2 Two General Assumptions on the Form of Clusters 


Before choosing a clustering algorithm, it is useful to have some general idea of what 
a good clustering should look like. In general, there are two types of assumptions often 
made that in turn lead to different classes of clustering algorithms. 


Center-based clusters: One assumption commonly made is that clusters are center- 
based. This means that the clustering can be defined by k centers c1,...,¢,, with each 
data point assigned to whichever center is closest to it. Note that this assumption does 
not yet tell whether one choice of centers is better than another. For this, one needs 
an objective, or optimization criterion. Three standard criteria often used are k-center, 
k-median, and k-means clustering, defined as follows. 


k-center clustering: Find a partition C = {C1,..., Cp} of A into k clusters, with corre- 
sponding centers C;,,...,Cz, to minimize the maximum distance between any data 
point and the center of its cluster. That is, we want to minimize 
k 
Pkcenter(C) = d(a;, C;). 
keenter (C) ecu (ai, cj) 
k-center clustering makes sense when we believe clusters should be local regions in 
space. It is also often thought of as the “firehouse location problem” since one can 
think of it as the problem of locating k fire-stations in a city so as to minimize the 
maximum distance a fire-truck might need to travel to put out a fire. 


k-median clustering: Find a partition C = {C),...,C;} of A into k clusters, with corre- 
sponding centers C;,...,Cz, to minimize the sum of distances between data points 
and the centers of their clusters. That is, we want to minimize 


Pkmedian (C) = ` ` d(aj, C;). 


j=l aieC; 


k-median clustering is more noise-tolerant than k-center clustering because we are 
taking a sum rather than a max. A small number of outliers will typically not 
change the optimal solution by much, unless they are very far away or there are 
several quite different near-optimal solutions. 


k-means clustering: Find a partition C = {Ci,...,C,} of A into k clusters, with cor- 
responding centers C;,...,¢,, to minimize the sum of squares of distances between 
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data points and the centers of their clusters. That is, we want to minimize 


Pkmeans(C) = ` `y d’ (ai, cy). 


j=1 a¡eC; 


k-means clustering puts more weight on outliers than k-median clustering, because 
we are squaring the distances, which magnifies large values. This puts it somewhat 
in between k-median and k-center clustering in that regard. Using distance squared 
has some mathematical advantages over using pure distances when data are points in 
R?. For example, Corollary 7.2 that asserts that with the distance squared criterion, 
the optimal center for a given group of data points is its centroid. 


The k-means criterion is more often used when data consists of points in Rt, whereas 
k-median is more commonly used when we have a finite metric, that is, data are nodes in 
a graph with distances on edges. 


When data are points in R?, there are in general two variations of the clustering prob- 
lem for each of the criteria. We could require that each cluster center be a data point or 
allow a cluster center to be any point in space. If we require each center to be a data 
point, the optimal clustering of n data points into k clusters can be solved in time (5) 
times a polynomial in the length of the data. First, exhaustively enumerate all sets of k 
data points as the possible sets of k cluster centers, then associate each point to its nearest 
center and select the best clustering. No such naive enumeration procedure is available 
when cluster centers can be any point in space. But, for the k-means problem, Corol- 
lary 7.2 shows that once we have identified the data points that belong to a cluster, the 
best choice of cluster center is the centroid of that cluster, which might not be a data point. 


When k is part of the input or may be a function of n, the above optimization prob- 
lems are all NP-hard. So, guarantees on algorithms will typically involve either some 
form of approximation or some additional assumptions, or both. 


High-density clusters: If we do not believe our desired clusters will be center-based, 
an alternative assumption often made is that clusters consist of high-density regions sur- 
rounded by low-density “moats” between them. For example, in the clustering of Figure 
7.1 we have one natural cluster A that looks center-based but the other cluster B consists 
of a ring around cluster A. As seen in the figure, this assumption does not require clus- 
ters to correspond to convex regions and it can allow them to be long and stringy. We 
describe a non-center-based clustering method in Section 7.7. In Section 7.9 we prove the 
effectiveness of an algorithm which finds a “moat”, cuts up data “inside” the moat and 
“outside” into two pieces and recursively applies the same procedure to each piece. 





321f k is a constant, then as noted above, the version where the centers must be data points can be 
solved in polynomial time. 
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Figure 7.1: Example where the natural clustering is not center-based. 


7.1.3 Spectral Clustering 


An important part of a clustering toolkit when data lies in R? is Singular Value De- 
composition. Spectral Clustering refers to the following algorithm: Find the space V 
spanned by the top k right singular vectors of the matrix A whose rows are the data 
points. Project data points to V and cluster in the projection. 


An obvious reason to do this is dimension reduction, clustering in the d dimensional 
space where data lies is reduced to clustering in a k dimensional space (usually, k << d). 
A more important point is that under certain assumptions one can prove that spectral 
clustering gives a clustering close to the true clustering. We already saw this in the case 
when data is from a mixture of spherical Gaussians, Section 3.9.3. The assumption used is 
“the means separated by a constant number of Standard Deviations”. In Section 7.5, we 
will see that in a much more general setting, which includes common stochastic models, 
the same assumption, in spirit, yields similar conclusions. Section 7.4, has another setting 
with a similar result. 


7.2 k-Means Clustering 


We assume in this section that data points lie in Rt and focus on the k-means criterion. 


7.2.1 A Maximum-Likelihood Motivation 


We now consider a maximum-likelihood motivation for using the k-means criterion. 
Suppose that the data was generated according to an equal weight mixture of k spherical 
well-separated Gaussian densities centered at 41, MM3,++-, Hk, each with variance one in 
every direction. Then the density of the mixture is 


k 

LSD cpm 

Prob(x) = POT k e [x— Ha . 
i=1 


Denote by p(x) the center nearest to x. Since the exponential function falls off fast, 
assuming x is noticeably closer to its nearest center than to any other center, we can 
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approximate ae e mal? by e~-H#09I* since the sum is dominated by its largest term. 
Thus 


1 2 
y —|x—p(x)| 
Prob(x) = OTR Pears 
The likelihood of drawing the sample of points x1,x2,...,Xn from the mixture, if the 


centers were Hi, H2,» +. Hk, is approximately 


n 
[oP = ce Elan OP, 


i=1 


1 1 
kn (27r)na/2 


Minimizing the sum of squared distances to cluster centers finds the maximum likelihood 
Hı, H2; +++, Hk. This motivates using the sum of distance squared to the cluster centers. 


7.2.2 Structural Properties of the k-Means Objective 


Suppose we have already determined the clustering or the partitioning into C1, Ca,..., Cp. 
What are the best centers for the clusters? The following lemma shows that the answer 
is the centroids, the coordinate means, of the clusters. 


Lemma 7.1 Let {a1,a2,...,an} be a set of points. The sum of the squared distances of 
the a; to any point x equals the sum of the squared distances to the centroid of the a; plus 
n times the squared distance from x to the centroid. That is, 


X Jai- x}? = X la; — cl +n |e- x}? 


i i 
n 

where c = 4 Y aj is the centroid of the set of points. 
i=l 


Proof: 


Since c is the centroid, Y (a; — c) = 0. Thus, Y ja- x| = la; -c +n|ce-x? E 


2 


A corollary of Lemma 7.1 is that the centroid minimizes the sum of squared distances 
since the first term, Y |a; — c|?, is a constant independent of x and setting x = c sets the 


2 
second term, n [le — x]|”, to zero. 


Corollary 7.2 Let [a1,a2,...,an) be a set of points. The sum of squared distances of 
the a; to a point x is minimized when x is the centroid, namely x = LS aj. 
i 
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7.2.3 Lloyd's Algorithm 


Corollary 7.2 suggests the following natural strategy for k-means clustering, known as 
Lloyd’s algorithm. Lloyd’s algorithm does not necessarily find a globally optimal solution 
but will find a locally-optimal one. An important but unspecified step in the algorithm is 
its initialization: how the starting k centers are chosen. We discuss this after discussing 
the main algorithm. 


Lloyd’s algorithm: 
Start with k centers. 
Cluster each point with the center nearest to it. 
Find the centroid of each cluster and replace the set of old centers with the centroids. 
Repeat the above two steps until the centers converge according to some criterion, such 


as the k-means score no longer improving. 


This algorithm always converges to a local minimum of the objective. To show conver- 
gence, we argue that the sum of the squares of the distances of each point to its cluster 
center always improves. Each iteration consists of two steps. First, consider the step 
that finds the centroid of each cluster and replaces the old centers with the new centers. 
By Corollary 7.2, this step improves the sum of internal cluster distances squared. The 
second step reclusters by assigning each point to its nearest cluster center, which also 
improves the internal cluster distances. 


A problem that arises with some implementations of the k-means clustering algorithm 
is that one or more of the clusters becomes empty and there is no center from which to 
measure distance. A simple case where this occurs is illustrated in the following example. 
You might think how you would modify the code to resolve this issue. 


Example: Consider running the k-means clustering algorithm to find three clusters on 
the following 1-dimension data set: {2,3,7,8} starting with centers {0,5,10}. 


Sf 2 3 2s SY - A 


AA RR 


The center at 5 ends up with no items and there are only two clusters instead of the 
desired three. E 
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Figure 7.2: A locally-optimal but globally-suboptimal k-means clustering. 


As noted above, Lloyd’s algorithm only finds a local optimum to the k-means objective 
that might not be globally optimal. Consider, for example, Figure 7.2. Here data lies in 
three dense clusters in R?: one centered at (0, 1), one centered at (0, —1) and one centered 
at (3,0). If we initialize with one center at (0,1) and two centers near (3,0), then the 
center at (0,1) will move to near (0,0) and capture the points near (0,1) and (0,—1), 
whereas the centers near (3,0) will just stay there, splitting that cluster. 


Because the initial centers can substantially influence the quality of the result, there 
has been significant work on initialization strategies for Lloyd’s algorithm. One popular 
strategy is called “farthest traversal”. Here, we begin by choosing one data point as initial 
center cı (say, randomly), then pick the farthest data point from cı to use as ca, then 
pick the farthest data point from {c1,C2} to use as cz, and so on. These are then used 
as the initial centers. Notice that this will produce the correct solution in the example in 
Figure 7.2. 


Farthest traversal can unfortunately get fooled by a small number of outliers. To ad- 
dress this, a smoother, probabilistic variation known as k-means++ instead weights data 
points based on their distance squared from the previously chosen centers. Then it selects 
the next center probabilistically according to these weights. This approach has the nice 
property that a small number of outliers will not overly influence the algorithm so long as 
they are not too far away, in which case perhaps they should be their own clusters anyway. 


Another approach is to run some other approximation algorithm for the k-means 
problem, and then use its output as the starting point for Lloyd’s algorithm. Note that 
applying Lloyd’s algorithm to the output of any other algorithm can only improve its 
score. An alternative SVD-based method for initialization is described and analyzed in 
Section 7.5. 
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7.2.4 Ward's Algorithm 


Another popular heuristic for k-means clustering is Ward's algorithm. Ward's algo- 
rithm begins with each data point in its own cluster, and then repeatedly merges pairs of 
clusters until only k clusters remain. Specifically, Ward's algorithm merges the two clus- 
ters that minimize the immediate increase in k-means cost. That is, for a cluster C, define 
cost(C) = Y a eo (ai, c), where c is the centroid of C. Then Ward’s algorithm merges 
the pair (C, C”) minimizing cost(C U C”) — cost(C) — cost(C”). Thus, Ward’s algorithm 
can be viewed as a greedy k-means algorithm. 


7.2.5 k-Means Clustering on the Line 


One case where the optimal k-means clustering can be found in polynomial time is 
when points lie in Rt, i.e., on the line. This can be done using dynamic programming, as 
follows. 


First, assume without loss of generality that the data points a;,...,ad, have been 
sorted, so ay < dg < ... < An. Now, suppose that for some i > 1 we have already 
computed the optimal k'-means clustering for points a;,...,a; for all k’ < k; note that 
this is trivial to do for the base case of i = 1. Our goal is to extend this solution to points 
a1,...,@;41. To do so, observe that each cluster will contain a consecutive sequence of 
data points. So, given k’, for each 7 < i+ 1, compute the cost of using a single center 
for points a;,...,@:41, Which is the sum of distances of each of these points to their mean 
value. Then add to that the cost of the optimal k’ — 1 clustering of points 47,...,4j-1 
which we computed earlier. Store the minimum of these sums, over choices of j, as our 
optimal k'-means clustering of points a),...,aj;41. This has running time of O(kn) for a 
given value of 7. So overall our running time is O(kn2). 





7.3 k-Center Clustering 


In this section, instead of using the k-means clustering criterion, we use the k-center 
criterion. Recall that the k-center criterion partitions the points into k clusters so as to 
minimize the maximum distance of any point to its cluster center. Call the maximum dis- 
tance of any point to its cluster center the radius of the clustering. There is a k-clustering 
of radius r if and only if there are k spheres, each of radius r, which together cover all 
the points. Below, we give a simple algorithm to find k spheres covering a set of points. 
The following lemma shows that this algorithm only needs to use a radius that is at most 
twice that of the optimal k-center solution. Note that this algorithm is equivalent to the 
farthest traversal strategy for initializing Lloyd’s algorithm. 


The Farthest Traversal k-clustering Algorithm 


Pick any data point to be the first cluster center. At time t, for t = 2,3,...,k, 
pick the farthest data point from any existing cluster center; make it the t*” cluster 
center. 
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Theorem 7.3 If there is a k-clustering of radius 5, then the above algorithm finds a 
k-clustering with radius at most r. 


Proof: Suppose for contradiction that there is some data point x that is distance greater 
than r from all centers chosen. This means that each new center chosen was distance 
greater than r from all previous centers, because we could always have chosen x. This 
implies that we have k+1 data points, namely the centers chosen plus x, that are pairwise 
more than distance r apart. Clearly, no two such points can belong to the same cluster 
in any k-clustering of radius 5, contradicting the hypothesis. A 


7.4 Finding Low-Error Clusterings 


In the previous sections we saw algorithms for finding a local optimum to the k-means 
clustering objective, for finding a global optimum to the k-means objective on the line, and 
for finding a factor 2 approximation to the k-center objective. But what about finding 
a clustering that is close to the correct answer, such as the true clustering of proteins 
by function or a correct clustering of news articles by topic? For this we need some 
assumption about the data and what the correct answer looks like. The next few sections 
consider algorithms based on different such assumptions. 


7.5 Spectral Clustering 


Let A be an x d data matrix with each row a data point and suppose we want to partition 
the data points into k clusters. Spectral Clustering refers to a class of clustering algorithms 
which share the following outline: 


e Find the space V spanned by the top k (right) singular vectors of A. 
e Project data points into V. 


e Cluster the projected points. 


7.5.1 Why Project? 


The reader may want to read Section 3.9.3, which shows the efficacy of spectral clustering 
for data stochastically generated from a mixture of spherical Gaussians. Here, we look at 
general data which may not have a stochastic generation model. 


We will later describe the last step in more detail. First, lets understand the central 
advantage of doing the projection to V. It is simply that for any reasonable (unknown) 
clustering of data points, the projection brings data points closer to their cluster centers. 
This statement sounds mysterious and likely false, since the assertion is for ANY rea- 
sonable unknown clustering. We quantify it in Theorem 7.4. First some notation: We 
represent a k-clustering by a n x d matrix C (same dimensions as A), where row i of C is 
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Figure 7.3: Clusters in the full space and their projections 


the center of the cluster to which data point i belongs. So, there are only k distinct rows 
of C and each other row is a copy of one of these rows. The k-means objective function, 
namely, the sum of squares of the distances of data points to their cluster centers is 


n 


Y lai — ci]? = |A- Cll. 


i=1 


The projection reduces the sum of distance squares to cluster centers from ||A — C]|?, 
to at most 8k||4 — C||$ in the projection. Recall that ||A — C||, is the spectral norm, 
which is the top singular value of A — C. Now, ||A — C||} = >°,0?(A) and often, 
IA — Cll» >> Wk||A — Cll2 and so the projection substantially reduces the sum of 
squared distances to cluster centers. 


We will see later that in many clustering problems, including models like mixtures of 
Gaussians and Stochastic Block Models of communities, there is a desired clustering C 
where the regions overlap in the whole space, but are separated in the projection. Figure 
7.3 is a schematic illustration. Now we state the theorem and give its surprisingly simple 
proof. 


Theorem 7.4 Let A be ann x d matrix with Aj the projection of the rows of A to the 
subspace of the first k right singular vectors of A. For any matrix C of rank less than or 
equal to k 

|| Ax — Cllr < 8k]|A — Cll. 
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Ax is a matrix that is close to every C, in the sense || 4 — C|l% < 8k||A — C||?. While 
this seems contradictory, another way to state this is that for C far away from A, in 
Frobenius norm, ||A — C|]2 will also be high. 


Proof: Since the rank of (A; — C) is less than or equal to 2k, 
1,44 — Cll < 2k]14; — C|} and 


[Az = Cll2 < 1144 — All2 + |14 — Cla < 2/14 — Cle. 


The last inequality follows since A, is the best rank k approximation in spectral norm 
(Theorem 3.9) and C has rank at most k. The theorem follows. E 


Suppose now in the clustering C we would like to find, the cluster centers that are pairwise 
at least Q(Vk||A — C||2) apart. This holds for many clustering problems including data 
generated by stochastic models. Then, it will be easy to see that in the projection, 
most data points are a constant factor farther from centers of other clusters than their 
own cluster center and this makes it very easy for the following algorithm to find the 
clustering C modulo a small fraction of errors. 


7.5.2 The Algorithm 


Denote || A— C|l2/vn by o(C). In the next section, we give an interpretation of ||A— C||2 
indicating that o(C) is akin to the standard deviation of clustering C and hence the 
notation o(C’). We assume for now that o(C) is known to us for the desired clustering C. 
This assumption can be removed by essentially doing a binary search. 

Spectral Clustering - The Algorithm 


1. Find the top k right singular vectors of data matrix A and project rows of A to the 
space spanned by them to get A;.(cf. Section 3.5). 


2. Select a random row from A, and form a cluster with all rows of A; at distance less 
than 6ko(C)/e from it. 


3. Repeat Step 2 k times. 


Theorem 7.5 Ifin a k-clustering C, every pair of centers is separated by at least 15ko(C) /e 
and every cluster has at least en points in it, then with probability at least 1 — e, Spectral 
Clustering finds a clustering C" that differs from C on at most en points. 


Proof: Let v; denote row i of Az. We first show that for most data points, the projection 
of data point is within distance 3ka(C)/e of its cluster center. I.e., we show that |M] is 
small, where, 


M = {i : | vi; — ci] > 3ko0(C)/e}. 
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Now, ||A — Cll =D, Ivi — cil? > es [vi — al? > ee So, using Theorem 7.4, 
we get: E ; 
e < ||A, — CI? < 8kno2(C) = |M|< _ (7.1) 
Call a data point i “good” if i £ M. For any two good data points 7 and j belong- 
ing to the same cluster, since, their projections are within 3ka(C)/e of the center of the 
cluster, projections of the two data points are within 6ka(C)/e of each other. On the 
other hand, if two good data points ¿ and k are in different clusters, since, the centers 
of the two clusters are at least 15ko(C)/e apart, their projections must be greater than 
15ka(C)/e — 6ka(C)/e = 9ko(C)/e apart. So, if we picked a good data point (say point 
i) in Step 2, the set of good points we put in its cluster is exactly the set of good points 
in the same cluster as 2. Thus, if in each of the k executions of Step 2, we picked a good 
point, all good points are correctly clustered and since |M| < €?n, the theorem would hold. 


|M] 


To complete the proof, we must argue that the probability of any pick in Step 2 being 
bad is small. The probability that the first pick in Step 2 is bad is at most |M|/n < e?/k. 
For each subsequent execution of Step 2, all the good points in at least one cluster are 
remaining candidates. So there are at least (¿—e?)n good points left and so the probability 
that we pick a bad point is at most |M|/(< — €°)n which is at most e/k. The union bound 
over the k executions yields the desired result. 


7.5.3 Means Separated by Q(1) Standard Deviations 


For probability distribution on the real line, the mnemonic “means separated by six 
standard deviations” suffices to distinguish different distributions. Spectral Clustering 
enables us to do the same thing in higher dimensions provided k € O(1) and six is 
replaced by some constant. First we define standard deviation for general not necessarily 
stochastically generated data: it is just the maximum over all unit vectors v of the square 
root of the mean squared distance of data points from their cluster centers in the direction 
v, namely, the standard deviation o(C) of clustering C is defined as: 


1 z 1 1 
(O) = —Max.jojar D [ai 05) > VP = =Maxyjaal(A — C)vÊ = ZIA — CIÈ. 
i=1 
This coincides with the definition of o(C) we made earlier. Assuming k € O(1), it is easy 
to see that the Theorem 7.5 can be restated as 


If cluster centers in C are separated by 2(0(C)), then the spectral clustering 
algorithm finds C” which differs from C only in a small fraction of data points. 


It can be seen that the “means separated by Q (1) standard deviations” condition holds 


for many stochastic models. We illustrate with two examples here. First, suppose we have 
a mixture of k € O(1) spherical Gaussians, each with standard deviation one. The data 
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is generated according to this mixture. If the means of the Gaussians are Q(1) apart, 
then the condition - means separated by Q(1) standard deviations- is satisfied and so if 
we project to the SVD subspace and cluster, we will get (nearly) the correct clustering. 
This was already discussed in detail in Chapter ??. 


We discuss a second example. Stochastic Block Models are models of communities. 
Suppose there are k communities C¡,Ca3,..., Cp among a population of n people. Sup- 
pose the probability of two people in the same community knowing each other is p and 
if they are in different communities, the probability is q, where, q < p.33 We assume the 
events that person 7 knows person j are independent across all i and 7. 


Specifically, we are given an n x n data matrix A, where a;; = 1 if and only if 7 and 
j know each other. We assume the a,, are independent random variables, and use a; to 
denote the i” row of A. It is useful to think of A as the adjacency matrix of a graph, such 
as the friendship network in Facebook. We will also think of the rows a; as data points. 
The clustering problem is to classify the data points into the communities they belong to. 
In practice, the graph is fairly sparse, i.e., p and q are small, namely, O(1/n) or O(Inn/n). 


Consider the simple case of two communities with n/2 people in each and with 


y= g q= 2 where a, 6 € O(Inn). 


Let u and v be the centroids of the data points in community one and community two 
respectively; so u; ~ p for i € Ci and u; = q for j € C2 and v; ~ q for i E€ Ci and v; =p 
for 7 € Cy. We have 





n ay ay 
lu — v|? 20 X (a z. we (a y) , 


Q — 
Inter-centroid distance ~ A (7.2) 

yn 
We need to upper bound ||A — Cll,. This is non-trivial since we have to prove a 
uniform upper bound on |(A — C)v| for all unit vectors v. Fortunately, the subject to 


Random Matrix Theory (RMT) already does this for us. RMT tells that 
||A — Cll» < O*(/mp) = O* (Va), 


where, the O* hides logarithmic factors. So as long as a — 8 € 0*(./a), we have the 
means separated by Q(1) standard deviations and spectral clustering works. 








33 More generally, for each pair of communities a and b, there could be a probability pay that a person 
from community a knows a person from community b. But for the discussion here, we take Paa = p for 
all a and Pas = q, for all a Æ b. 
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One important observation is that in these examples as well as many others, the k- 
means objective function in the whole space is too high and so the projection is essential 
before we can cluster. 


7.5.4 Laplacians 


An important special case of spectral clustering is when k = 2 and a spectral algorithm 
is applied to the Laplacian matrix L of a graph, which is defined as 


L=D-A 


where A is the adjacency matrix and D is a diagonal matrix of degrees. Since A has a 
negative sign, we look at the lowest two singular values and corresponding vectors rather 
than the highest, 


Lis a symmetric matrix and is easily seen to be posiitve semi-definite: for any vector 


x, we have 
x! Lx = So dia? > 5 Lili = ; ` (x; = ry. 


(i j)EE (i, j)EE 


Also since all row sums of L (and L is symmetric) are zero, its lowest eignvalue is 0 with 
the eigenvector 1 of all 1’s. This is also the lowest singular vector of L. The projection 
of all data points (rows) to this vector is just the origin and so gives no information. If 
we take the second lowest singular vector and project to it which is essentially projecting 
to the space of the bottom two singular vectors, we get the very simple problem of n real 
numbers which we need to cluster into two clusters. 


7.5.5 Local spectral clustering 


So far our emphasis has been on partitioning the data into disjoint clusters. However, 
the structure of many data sets consists of over lapping communities. In this case using 
k-means with spectral clustering, the overlap of two communities shows up as a commu- 
nity. This is illustrated in Figure 7.4. 


An alternative to using k-means with spectral clustering is to find the minimum 1- 
norm vector in the space spanned by the top singular vectors. Let A be the matrix whose 
columns are the singular vectors. To find a vector y in the space spanned by the columns 
of A solve the linear system Ax = y. This is a slightly different looking problem then 
Ax = c where c is a constant vector. To convert Ax = y to the more usual form write it 
as [4, —1] ( a 
Thus we add the row 1,1,...1,0,0,...0 to [4,—/] and a 1 to the top of the vector |x, y] 
to force the coordinates of x to add up to one. Minimizing ||y||; does not appear to be a 
linear program but we can write Y = Ya — Y» and require Ya > 0 and y > 0. Now finding 


= 0. However, if we want to minimize ||y||, the solution is  = y = 0. 
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the minimum one norm vector in the span of the columns of A is the linear program 


x 


min (= Yai + y m) subject to (A, =T I) Ya =0 Ya >0 Yo >0. 
i i Yb 


Local communities 

In large social networks with a billion vertices, global clustering is likely to result in 
communities of several hundred million vertices. What you may actually want is a local 
community containing several individuals with only 50 vertices. To do this if one starts a 
random walk at a vertex v and computes the frequency of visiting vertices, it will converge 
to the first singular vector. However, the distribution after a small number of steps will 
be primarily in the small community containing v and will be proportional to the first 
singular vector distribution restricted to the vertices in the small community containing v, 
only will be higher by some constant value. If one wants to determine the local communi- 
ties containing vertices v1, V2, and v3, start with three probability distributions, one with 
probability one at vı, one with probability one at vz, and one with probability one at v3 
and find early approximation to the first three singular vectors. Then find the minimum 
1-norm vector in the space spanned by the early approximations. 


Hidden structure 

In the previous section we discussed overlapping communities. Another issue is hidden 
structure. Suppose the vertices of a social network could be partitioned into a number of 
strongly connected communities. By strongly connected we mean the probability of an 
edge between two vertices in a community is much higher than the probability of an edge 
between two vertices in different communities. Suppose the vertices of the graph could 
be partitioned in another way which was incoherent** with the first partitioning and the 
probability of an edge between two vertices in one of these communities is higher than an 
edge between two vertices in different communities. If the probability of an edge between 
two vertices in a community of this second partitioning is less than that in the first, then 
a clustering algorithm is likely to find the first partitioning rather than the second. How- 
ever, the second partitioning, which we refer to as hidden structure, may be the structure 
that we want to find. The way to do this is to use your favorite clustering algorithm to 
produce the dominant structure and then stochastically weaken the dominant structure 
by removing some community edges in the graph. Now if you apply the clustering algo- 
rithm to the modified graph, it can find the hidden community structure. Having done 
this, go back to the original graph, weaken the hidden structure and reapply the cluster- 
ing algorithm. It will now find a better approximation to the dominant structure. This 
technology can be used to find a number of hidden levels in several types of social networks. 





34incoherent, give definition 
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Figure 7.4: (a) illustrates the adjacency matrix of a graph with a six vertex clique that 
overlaps a five vertex clique in two vertices. (b) illustrates the matrix where columns 
consist of the top two singular vectors, and (c) illustrates the mapping of rows in the 
singular vector matrix to three points in two dimensional space. Instead of two cliques 
we get the non-overlapping portion of each of the two clique plus their intersection as 
communities instead of the two cliques as communities. 


Block model 

One technique for generating graphs with communities is to use the block model where 
the vertices are partitioned into blocks and each block is a GNP graph generated with 
some edge probability. The edges in off diagonal blocks are generated with a lower prob- 
ability. One can also generate graphs with hidden structure. For example, the vertices in 
an n vertex graph might be partitioned into two communities, the first community having 
vertices 1 to n/2 and the second community having vertices n/2 +1 to n. The dominant 
structure is generated with probability pı for edges within communities and probability qı 
for edges between communities. The the vertices are randomly permuted and the hidden 
structure is generated using the first n/2 vertices in the permuted order for one commu- 
nity and the remaining vertices for the second community with probabilities pọ and q2 
which ar lower than pı and q1.. 


An interesting question is how to determine the quality of a community found. Many 
researchers use an existing standard of what the communities are. However, if you want 
to using clustering techniques to find communities there probably is no external standard 
or you would just use that instead of clustering. A way to determine if you have found a 
real community structure is to ask if the graph is more likely generated by a model of the 
structure found than by a completely random model. Suppose you found a partition of 
two communities each with n/2 vertices. Using the number of edges in each community 
and the number of inter community edges ask what is the probability of the graph being 
generated by a bloc model where p and q are the probabilities determined by the edge 
density within communities and the edge density between communities. One can compare 
this probability with the probability that the graph was generated by a GNP model with 
probability (p + q)/2. 
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7.6 Approximation Stability 
7.6.1 The Conceptual Idea 


We now consider another condition that will allow us to produce accurate clusters 
from data. To think about this condition, imagine that we are given a few thousand news 
articles that we want to cluster by topic. These articles could be represented as points in 
a high-dimensional space (e.g., axes could correspond to different meaningful words, with 
coordinate i indicating the frequency of that word in a given article). Or, alternatively, it 
could be that we have developed some text-processing program that given two articles x 
and y computes some measure of distance d(x, y) between them. We assume there exists 
some correct clustering Cr of our news articles into k topics; of course, we do not know 
what Cr is, that is what we want our algorithm to find. 


If we are clustering with an algorithm that aims to minimize the k-means score of its 
solution, then implicitly this means we believe that the clustering CO?" | of minimum k- 
means score is either equal to, or very similar to, the clustering Cr. Unfortunately, finding 
the clustering of minimum k-means score is NP-hard. So, let us broaden our belief a bit 
and assume that any clustering C whose k-means score is within 10% of the minimum 
is also very similar to Cr. This should give us a little bit more slack. Unfortunately, 
finding a clustering of score within 10% of the minimum is also an NP-hard problem. 
Nonetheless, we will be able to use this assumption to efficiently find a clustering that 1s 
close to Cr. The trick is that NP-hardness is a worst-case notion, whereas in contrast, 
this assumption implies structure on our data.In particular, it implies that all clusterings 
that have score within 10% of the minimum have to be similar to each other. We will 
then be able to utilize this structure in a natural “ball-growing” clustering algorithm. 


7.6.2 Making this Formal 


To make this discussion formal, we first specify what we mean when we say that two 
different ways of clustering some data are “similar” to each other. Let C = {C1,..., Ck} 
and C’ = {Cj,...,C;,} be two different k-clusterings of some dataset A. For example, C 
could be the clustering that our algorithm produces, and C’ could be the clustering Cr. 
Let us define the distance between these two clusterings to be the fraction of points that 
would have to be re-clustered in C to make C match C’, where by “match” we mean that 
there should be a bijection between the clusters of C and the clusters of C’. We can write 
this distance mathematically as: 


k 
; . 1 
dist(C, C’) = min F S OY: 
i=1 
where the minimum is over all permutations ø of {1,..., k}. 


For c > 1 and e > 0 we say that a data set satisfies (c, e) -approrimation-stability 
with respect to a given objective (such as k-means or k-median) if every clustering C 
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whose cost is within a factor c of the minimum-cost clustering for that objective satisfies 
dist(C,Cr) < e. That is, it is sufficient to be within a factor c of optimal to the our 
objective in order for the fraction of points clustered incorrectly to be less than e. We will 
specifically focus in this discussion on the k-median objective rather than the k-means 
objective, since it is a bit easier to work with. 


What we will now show is that under this condition, even though it may be NP-hard 
in general to find a clustering that is within a factor c of optimal, we can nonetheless 
efficiently find a clustering C’ such that dist(C’,Cr) < e, so long as all clusters in Cr are 
reasonably large. To simplify notation, let C* denote the clustering of minimum k-median 
cost, and to keep the discussion simpler, let us also assume that Cr = C*; that is, the 
target clustering is also the clustering with the minimum k-median score. 


7.6.3 Algorithm and Analysis 


Before presenting an algorithm, we begin with a helpful lemma that will guide our 
design. For a given data point a;, define its weight w(a;) to be its distance to the center 
of its cluster in C*. Notice that the k-median cost of C* is OPT = );_, w(a;). Define 
Wavg = OPT /n to be the average weight of the points in A. Finally, define w2(a;) to be 
the distance of a; to its second-closest center in C*. 


Lemma 7.6 Assume dataset A satisfies (c, e) approrimation-stability with respect to the 
k-median objective, each cluster in Cr has size at least 2en, and Cr = C*. Then, 


1. Fewer than en points a; have wa(a;) — w(a;) < (c— 1)Wang/€- 


2. At most 5en/(c— 1) points a; have w(a;) > (c — 1)Wang/(5€). 


Proof: For part (1), suppose that en points a; have wa(a;) — w(a;) < (c — 1)Wag/€- 
Consider modifying Cr to a new clustering C’ by moving each of these points a; into 
the cluster containing its second-closest center. By assumption, the k-means cost of the 
clustering has increased by at most en(c — 1)Wag/€ = [c— 1) - OPT. This means that 
the cost of the new clustering is at most COPT. However, dist(C’,Cr) = € because (a) we 
moved en points to different clusters, and (b) each cluster in Cr has size at least 2en so the 
optimal permutation ø in the definition of dist remains the identity. So, this contradicts 
approximation stability. Part (2) follows from the definition of “average”; if it did not 
hold then > 7_, w(a;) > nwWavg, a contradiction. A 


A datapoint a; is bad if it satisfies either item (1) or (2) of Lemma 7.6 and good if it 
satisfies neither one. So, there are at most b = en+ Sn bad points and the rest are good. 
Define “critical distance” derit = A Lemma 7.6 implies that the good points have 
distance at most derit to the center of their own cluster in C* and distance at least 5derit 


to the center of any other cluster in C*. 
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This suggests the following algorithm. Suppose we create a graph G with the points 
a, as vertices, and edges between any two points a; and a; with d(a;,a;) < 2d.,i:. Notice 
that by triangle inequality, the good points within the same cluster in C* have distance 
less than 2derit from each other so they will be fully connected and form a clique. Also, 
again by triangle inequality, any edge that goes between different clusters must be be- 
tween two bad points. In particular, if a; is a good point in one cluster, and it has an edge 
to some other point aj, then a; must have distance less than 3deri to the center of a,'s 
cluster. This means that if a; had a different closest center, which obviously would also 
be at distance less than 3derit, then a; would have distance less than 2derit + 3derit = Sderit 
to that center, violating its goodness. So, bridges in G between different clusters can only 
occur between bad points. 


Assume now that each cluster in Cr has size at least 2b+1; this is the sense in which we 
are requiring that en be small compared to the smallest cluster in Cr. In this case, create 
a new graph H by connecting any two points a; and a; that share at least b+ 1 neighbors 
in common in G, themselves included. Since every cluster has at least 2b+-1—b=b+1 
good points, and these points are fully connected in G, this means that H will contain an 
edge between every pair of good points in the same cluster. On the other hand, since the 
only edges in G between different clusters are between bad points, and there are at most 
b bad points, this means that A will not have any edges between different clusters in Cr. 
Thus, if we take the k largest connected components in H, these will all correspond to 
subsets of different clusters in Cr, with at most b points remaining. 


At this point we have a correct clustering of all but at most b points in A. Call these 
clusters C1,..., Cr, where C; € C}. To cluster the remaining points a;, we assign them 
to the cluster C; that minimizes the median distance between a; and points in C}. Since 
each Cj has more good points than bad points, and each good point in C; has distance at 
most deri; to center c}, by triangle inequality the median of these distances must lie in the 
range {d(a;, ci) — derit, d(a;, 7) + derit]. This means that this second step will correctly 
cluster all points a; for which wo(a;) — w(a;) > 2d.,i. In particular, we correctly cluster 
all points except possibly for some of the at most en satisfying item (1) of Lemma 7.6. 


The above discussion assumes the value deri is known to our algorithm; we leave it as 
an exercise to the reader to modify the algorithm to remove this assumption. Summariz- 
ing, we have the following algorithm and theorem. 


Algorithm k-Median Stability (given c, €, derit) 


1. Create a graph G with a vertex for each datapoint in A, and an edge between 
vertices ¿ and j if d(a;, aj) < 2derit. 


2. Create a graph H with a vertex for each vertex in G and an edge between vertices i 
and j if 2 and j share at least b+ 1 neighbors in common, themselves included, for 


b = en + 2. Let Ci,...,C, denote the k largest connected components in H. 
c-1 
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3. Assign each point not in C¡U...UC; to the cluster Cj of smallest median distance. 


Theorem 7.7 Assume A satisfies (c,e) approximation-stability with respect to the k- 
median objective, that each cluster in Cr has size at least Hent 2en+1, and that Cr = C*. 
Then Algorithm k-Median Stability will find a clustering C such that dist(C,Cr) < e. 





7.7 High-Density Clusters 


We now turn from the assumption that clusters are center-based to the assumption 
that clusters consist of high-density regions, separated by low-density moats such as in 
Figure 7.1. 


7.7.1 Single Linkage 


One natural algorithm for clustering under the high-density assumption is called single 
linkage. This algorithm begins with each point in its own cluster and then repeatedly 
merges the two “closest” clusters into one, where the distance between two clusters is 
defined as the minimum distance between points in each cluster. That is, dmin(C,C’) = 
minxecyec’ d(x, y), and the algorithm merges the two clusters C and C” whose dmin value 
is smallest over all pairs of clusters breaking ties arbitrarily. It then continues until there 
are only k clusters. This is called an agglomerative clustering algorithm because it begins 
with many clusters and then starts merging, or agglomerating them together. Single- 
linkage is equivalent to running Kruskal’s minimum-spanning-tree algorithm, but halting 
when there are k trees remaining. The following theorem is fairly immediate. 


Theorem 7.8 Suppose the desired clustering Ci,...,C; satisfies the property that there 
exists some distance o such that 


1. any two data points in different clusters have distance at least o, and 


2. for any cluster Cf and any partition of C} into two non-empty sets A and CF \ A, 
there exist points on each side of the partition of distance less than o. 


Then, single-linkage will correctly recover the clustering CT,...,Ch . 


Proof: Consider running the algorithm until all pairs of clusters C and C” have dmin[C, C") > 0. 
At that point, by (2), each target cluster Cf will be fully contained within some cluster 
of the single-linkage algorithm. On the other hand, by (1) and by induction, each cluster 
C of the single-linkage algorithm will be fully contained within some C of the target 
clustering, since any merger of subsets of distinct target clusters would require din > 0. 
Therefore, the single-linkage clusters are indeed the target clusters. A 





35Other agglomerative algorithms include complete linkage which merges the two clusters whose maz- 
imum distance between points is smallest, and Ward’s algorithm described earlier that merges the two 
clusters that cause the k-means cost to increase by the least. 
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7.7.2 Robust Linkage 


The single-linkage algorithm is fairly brittle. A few points bridging the gap between 
two different clusters can cause it to do the wrong thing. As a result, there has been 
significant work developing more robust versions of the algorithm. 


One commonly used robust version of single linkage is Wishart’s algorithm. A ball 
of radius r is created for each point with the point as center. The radius r is gradually 
increased starting from r = 0. The algorithm has a parameter t. When a ball has t or 
more points the center point becomes active. When two balls with active centers intersect 
the two center points are connected by an edge. The parameter t prevents a thin string 
of points between two clusters from causing a spurious merger. Note that Wishart’s al- 
gorithm with t = 1 is the same as single linkage. 


In fact, if one slightly modifies the algorithm to define a point to be live if its ball 
of radius r/2 contains at least t points, then it is known [CD10] that a value of t = 
O(dlogn) is sufficient to recover a nearly correct solution under a natural distributional 
formulation of the clustering problem. Specifically, suppose data points are drawn from 
some probability distribution D over R*, and that the clusters correspond to high-density 
regions surrounded by lower-density moats. More specifically, the assumption is that 


1. for some distance o > 0, the o-interior of each target cluster CF has density at least 
some quantity A (the o-interior is the set of all points at distance at least o from 
the boundary of the cluster), 


2. the region between target clusters has density less than A(1 — e) for some e > 0, 
3. the clusters should be separated by distance greater than 20, and 
4. the o-interior of the clusters contains most of their probability mass. 


Then, for sufficiently large n, the algorithm will with high probability find nearly correct 
clusters. In this formulation, we allow points in low-density regions that are not in any 
target clusters at all. For details, see [CD10]. 


Robust Median Neighborhood Linkage robustifies single linkage in a different way. 
This algorithm guarantees that if it is possible to delete a small fraction of the data such 
that for all remaining points x=, most of their |C*(x)| nearest neighbors indeed belong to 
their own cluster C*(x), then the hierarchy on clusters produced by the algorithm will 
include a close approximation to the true clustering. We refer the reader to [BLG14] for 
the algorithm and proof. 


7.8 Kernel Methods 


Kernel methods combine aspects of both center-based and density-based clustering. 
In center-based approaches like k-means or k-center, once the cluster centers are fixed, the 
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Voronoi diagram of the cluster centers determines which cluster each data point belongs 
to. This implies that clusters are pairwise linearly separable. 


If we believe that the true desired clusters may not be linearly separable, and yet we 
wish to use a center-based method, then one approach, as in the chapter on learning, 
is to use a kernel. Recall that a kernel function K(x,y) can be viewed as performing 
an implicit mapping @¢@ of the data into a possibly much higher dimensional space, and 
then taking a dot-product in that space. That is, K(x,y) = (x) - (y). This is then 
viewed as the affinity between points x and y. We can extract distances in this new 
space using the equation |z1 — Za]? =Z1 - Z1 + Z2 : Z2 — 221 ` Z2, so in particular we have 
\o(x) —ó(y)?= K(x,x)+K(y, y) -2K(x, y). We can then run a center-based clustering 
algorithm on these new distances. 


One popular kernel function to use is the Gaussian kernel. The Gaussian kernel uses 
an affinity measure that emphasizes closeness of points and drops off exponentially as the 
points get farther apart. Specifically, we define the affinity between points x and y by 


K(x, y) = ez ly ll? 


Another way to use affinities is to put them in an affinity matrix, or weighted graph. 
This graph can then be separated into clusters using a graph partitioning procedure such 
as the one in following section. 


7.9 Recursive Clustering Based on Sparse Cuts 


We now consider the case that data are nodes in an undirected connected graph 
G(V, E) where an edge indicates that the end point vertices are similar. Recursive clus- 
tering starts with all vertices in one cluster and recursively splits a cluster into two parts 
whenever there is a small number of edges from one part to the other part of the cluster. 
Formally, for two disjoint sets S and T' of vertices, define 


Number of edges from S to T 





ES Total number of edges incident to S in G` 

®(S,7) measures the relative strength of similarities between S and T. Let d(i) be the 
degree of vertex i and for a subset S of vertices, let d(S) = > ¡¿y d(i). Let m be the total 
number of edges in the graph. The following algorithm aims to cut only a small fraction 
of the edges and to produce clusters that are internally consistent in that no subset of the 
cluster has low similarity to the rest of the cluster. 


Recursive Clustering: Select an appropriate value for e. If a current cluster 
W has a subset S with d(S) < $d(W) and D(S, S C W) < e, then split W 
into two clusters S and W \ S. Repeat until no such split is possible. 


Theorem 7.9 At termination of Recursive Clustering, the total number of edges between 
vertices in different clusters is at most O(em Inn). 
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Proof: Each edge between two different clusters at the end was deleted at some stage 
by the algorithm. We will “charge” edge deletes to vertices and bound the total charge. 
When the no n a cluster W into S and WAS with d(S) < (1/2)d(W), each 
k € S is charged 4 any g times the number of edges being deleted. Since #(S, WA S) < e, 
the charge added to each k € W is a most ed(k). A vertex is charged only when it is 
in the smaller part, d(S) < d(W)/2, of the cut. So between any two times it is charged, 
d(W) is reduced by a factor of at least two and so a vertex can be charged at most 
log, m < O(Inn) times, proving the theorem. E 


Implementing the algorithm requires computing Minscw®(S, WA S) which is an NP- 
hard problem. So the theorem cannot be implemented right away. Luckily, eigenvalues and 
eigenvectors, which can be computed fast, give an approximate answer. The connection 
between eigenvalues and sparsity, known as Cheeger’s inequality, is deep with applications 
to Markov chains among others. We do not discuss this here. 


7.10 Dense Submatrices and Communities 


Represent n data points in d-space by the rows of an n x d matrix A. Assume that A 
has all non-negative entries. Examples to keep in mind for this section are the document- 
term matrix and the customer-product matrix. We address the question of how to define 
and find efficiently a coherent large subset of rows. To this end, the matrix A can be 
represented by a bipartite graph Figure 7.5. One side has a vertex for each row and the 
other side a vertex for each column. Between the vertex for row 7 and the vertex for 
column j, there is an edge with weight as;. 


We want a subset S of row vertices and a subset T of column vertices so that 


LS. Gg 


¡eS ¡ET 


is high. This simple definition is not good since A(S,T') will be maximized by taking 
all rows and columns. We need a balancing criterion that ensures Aas A(S,T) is high 
relative to the sizes of S and T'. One possibility is to maximize Sm . This is not a good 
measure either, since it is maximized by the single edge of highest Sie The definition 
we use is the following. Let A be a matrix with non-negative entries. For a subset S of 


rows and a subset T of columns, the density d(S, T) of S and T is d(S, T) = VES The 


density d(A) of A is defined as the maximum value of d(S, T') over all subsets of rows and 
columns. This definition applies to bipartite as well as non-bipartite graphs. 





One important case is when A’s rows and columns both Pe the same set and 
aij is the similarity between object i and object j. Here d(S, S) = El .IfAisannxn 
0-1 matrix, it can be thought of as the adjacency matrix of an undirected graph, and 
d(S, S) is the average degree of a vertex in S. The subgraph of maximum average degree 
in a graph can be found exactly by network flow techniques, as we will show in the next 
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Figure 7.5: Example of a bipartite graph. 


section. We do not know an efficient (polynomial-time) algorithm for finding d(A) exactly 
in general. However, we show that d(A) is within a O(log’ n) factor of the top singular 
value of A assuming |a;;| < 1 for all ¿ and j. This is a theoretical result. The gap may be 
much less than O(log? n) for many problems, making singular values and singular vectors 
quite useful. Also, S and T with d(S, T) > Q(d(A)/log?n) can be found algorithmically. 


Theorem 7.10 Let A be ann x d matriz with entries between 0 and 1. Then 


(A) 
A) > d(A) > 
AENA S 4logn log d 


oi(A) 


Furthermore, subsets S and T satisfying d(S, T) > Tenozad 


singular vector of A. 


may be found from the top 


Proof: Let S and T be the subsets of rows and columns that achieve d(A) = d(S,T). 


Consider an n-vector u that is —— on S and 0 elsewhere and a d-vector v that is — 
VISI YITI 


[S| 
on T and 0 elsewhere. Then, 


O71 (A) > u Av = tO = d(S, T) = d(A) 
aj 


establishing the first inequality. 


To prove the second inequality, express cı (A) in terms of the first left and right 
singular vectors x and y. 


01 (A) E =x Ay = 2 QijYj, |x| i ly] =1. 


Since the entries of A are non-negative, the components of the first left and right singular 
vectors must all be non-negative, that is, 1, > 0 and y; > 0 for all 7 and j. To bound 
) 1,0545, break the summation into O (log nlog d) parts. Each part corresponds to a 


ij 
given a and @ and consists of all ¿ such that a < x; < 2a and all j such that 6 < y; < 26. 
The lop flog g parts are defined by breaking the rows into logn blocks with q: equal to 


3 J Tm 2s, da ... , 1 and by breaking the columns into log d blocks with P equal 
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will 








to ava" i 7 a ... , 1. The 1 such that z; < In and the j such that y; < 3/7 


be ignored at a loss of at most 301(4). Exercise 7.27 proves the loss is at most this amount. 


Since 22 = 1, the set S = a < x; < 2a} has |S| < 4 and similarly, 


EOS < y; < 28) has |T| < zz. Thus 


y S “wits aij < 4aBA(S,T) 
a<0<%a B<y/<28 


< daBd(S,T)V/|S||T| 
< 4d(S,T) 
< 4d(A). 


From this it follows that 
cı (A) < 4d (A) logn log d 


or 
d (A) > _271(A) 


— Alognlogd 
proving the second inequality. 


It is clear that for each of the values of (a, 3), we can compute A(S,7) and a(S, T) 
as above and taking the best of these d(S,T) ’s gives us an algorithm as claimed in the 
theorem. E 


Note that in many oia E non-zero values of x; and y; after zeroing out the low 
entries will only go from 3 = for x; and 5 Ja to + for yj, since the singular vectors 
are likely to be balanced: asa vÉ Qij are ail adan 0 and 1. In this case, there will 
be O(1) groups only and the log factors disappear. 


Another measure of density is based on similarities. Recall that the similarity between 
objects represented by vectors (rows of A) is defined by their dot products. Thus, simi- 
larities are entries of the matrix AAT. Define the average cohesion f(S) of a set S of rows 
of A to be the sum of all pairwise dot products of rows in S divided by |S|. The average 
cohesion of A is the maximum over all subsets of rows of the average cohesion of the subset. 


Since the singular values of AAT are squares of singular values of A, we expect f(A) 
to be related to 0,(4)? and d(A)?. Indeed it is. We state the following without proof. 


Lemma 7.11 d(A)? < f(A) <d(A)logn. Also, (4)? > f(A) > HÁ cta? 


— logn 


f(A) can be found exactly using flow techniques as we will see later. 
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7.11 Community Finding and Graph Partitioning 


Assume that data are nodes in a possibly weighted graph where edges represent some 
notion of affinity between their endpoints. In particular, let G = (V, E) be a weighted 
graph. Given two sets of nodes S and T', define 





ics 
jeT 
We then define the density of a set S to be 
E(S,S 
ass) = 22 


If G is an undirected graph, then d(S, S) can be viewed as the average degree in the 
vertex-induced subgraph over S. The set S of maximum density is therefore the subgraph 
of maximum average degree. Finding such a set can be viewed as finding a tight-knit 
community inside some network. In the next section, we describe an algorithm for finding 
such a set using network flow techniques. 

Flow Methods Here we consider dense induced subgraphs of a graph. An 
induced subgraph of a graph consisting of a subset of the vertices of the graph along with 
all edges of the graph that connect pairs of vertices in the subset of vertices. We show 
that finding an induced subgraph with maximum average degree can be done by network 
flow techniques. This is simply maximizing the density d(S, S) over all subsets S of the 
graph. First consider the problem of finding a subset of vertices such that the induced 
subgraph has average degree at least A for some parameter A. Then do a binary search 
on the value of A until the maximum A for which there exists a subgraph with average 
degree at least A is found. 


Given a graph G in which one wants to find a dense subgraph, construct a directed 
graph H from the given graph and then carry out a flow computation on H. H has a 
node for each edge of the original graph, a node for each vertex of the original graph, 
plus two additional nodes s and t. There is a directed edge with capacity one from s to 
each node corresponding to an edge of the original graph and a directed edge with infinite 
capacity from each node corresponding to an edge of the original graph to the two nodes 
corresponding to the vertices the edge connects. Finally, there is a directed edge with 
capacity A from each node corresponding to a vertex of the original graph to t. 


Notice there are three types of cut sets of the directed graph that have finite capacity, 
Figure 7.6. The first cuts all arcs from the source. It has capacity e, the number of edges 
of the original graph. The second cuts all edges into the sink. It has capacity Av, where v 
is the number of vertices of the original graph. The third cuts some arcs from s and some 
arcs into t. It partitions the set of vertices and the set of edges of the original graph into 
two blocks. The first block contains the source node s, a subset of the edges e,, and a 
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vertices 
Figure 7.6: The directed graph A used by the flow technique to find a dense subgraph 


subset of the vertices v, defined by the subset of edges. The first block must contain both 
end points of each edge in e,; otherwise an infinite arc will be in the cut. The second block 
contains t and the remaining edges and vertices. The edges in this second block either 
connect vertices in the second block or have one endpoint in each block. The cut set will 
cut some infinite arcs from edges not in e, coming into vertices in vs. However, these 
arcs are directed from nodes in the block containing t to nodes in the block containing s. 
Note that any finite capacity cut that leaves an edge node connected to s must cut the 
two related vertex nodes from t, Figure 7.6. Thus, there is a cut of capacity e — €, + Av, 
where v, and e, are the vertices and edges of a subgraph. For this cut to be the minimal 
cut, the quantity e — es + Av, must be minimal over all subsets of vertices of the original 
graph and the capcity must be less than e and also less than Av. 


If there is a subgraph with v, vertices and e, edges where the ratio © is sufficiently 
large so that T > £, then for À such that = > À > 084704 > 0 and ee + Aus < e. 
Similarly e < Av and thus e — es + Av, < Av. This implies that the cut e — es + Av, is less 
than either e or Av and the flow algorithm will find a non-trivial cut and hence a proper 
subset. For different values of A in the above range there maybe different non-trivial cuts. 


Note that for a given density of edges, the number of edges grows as the square of the 
number of vertices and $ is less likely to exceed $ if vs is small. Thus, the flow method 
works well in finding large subsets since it o with <<. To find small communities one 


would need to use a method that worked with $ as the tells example illustrates. 


Example: Consider finding a dense subgraph of 1,000 vertices and 2,000 internal edges in 
a graph of 10% vertices and 6x 10° edges. For concreteness, assume the graph was generated 
by the following process. First, a 1,000-vertex graph with 2,000 edges was generated as a 
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edges and vertices 
in the community 


Figure 7.7: Cut in flow graph 


random regular degree four graph. The 1,000-vertex graph was then augmented to have 
10% vertices and edges were added at random until all vertices were of degree 12. Note 
that each vertex among the first 1,000 has four edges to other vertices among the first 
1,000 and eight edges to other vertices. The graph on the 1,000 vertices is much denser 
than the whole graph in some sense. Although the subgraph induced by the 1,000 vertices 
has four edges per vertex and the full graph has twelve edges per vertex, the probability 
of two vertices of the 1,000 being connected by an edge is much higher than for the graph 
as a whole. The probability is given by the ratio of the actual number of edges connecting 
vertices among the 1,000 to the number of possible edges if the vertices formed a complete 
graph. 
e 2e 


0) ven 


For the 1,000 vertices, this number is p = a = 4 x 1078. For the entire graph this 








number is p = 26x 10° = 12 x 1076. This difference in probability of two vertices being 
connected should allow us to find the dense subgraph. A 


In our example, the cut of all ares out of s is of capacity 6 x 10%, the total number 
of edges in the graph, and the cut of all arcs into t is of capacity À times the number 
of vertices or Ax 10%. A cut separating the 1,000 vertices and 2,000 edges would have 
capacity 6 x 10° — 2,000 + A x 1, 000. Thig cut cannot be the minimum cut for any value 
of A since = = 2 and $ = 6, hence <= < £. The point is that to find the 1,000 vertices, we 
have to maximize A(S, sy/\s? rather than A(S,S)/|S|. Note that A(S, 8)/\82 penalizes 
large |S| much more and therefore can find the 1,000 node “dense” subgraph. 
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Figure 7.8: Illustration of spectral clustering. 


7.12 Spectral Clustering Applied to Social Networks 


Finding communities in social networks is different from other clustering for several 
reasons. First we often want to find communities of size say 20 to 50 in networks with 
100 million vertices. Second a person is in a number of overlapping communities and thus 
we are not finding disjoint clusters. Third there often are a number of levels of structure 
and a set of dominant communities may be hiding a set of weaker communities that are 
of interest. Spectral clustering is one approach to these issues. 


In spectral clustering of the vertices of a graph, one creates a matrix V whose columns 
correspond to the first k singular vectors of the adjacency matrix. Each row of V is 
the projection of a row of the adjacency matrix to the space spanned by the k singular 
vectors. In the example below, the graph has five vertices divided into two cliques, one 
consisting of the first three vertices and the other the last two vertices. The top two right 
singular vectors of the adjacency matrix, not normalized to length one, are (1,1,1,0, 0)? 
and (0,0,0,1,1)7. The five rows of the adjacency matrix projected to these vectors form 
the 5 x 2 matrix in Figure 7.8. Here, there are two ideal clusters with all edges inside a 
cluster being present including self-loops and all edges between clusters being absent. The 
five rows project to just two points, depending on which cluster the rows are in. If the 
clusters were not so ideal and instead of the graph consisting of two disconnected cliques, 
the graph consisted of two dense subsets of vertices where the two sets were connected by 
only a few edges, then the singular vectors would not be indicator vectors for the clusters 
but close to indicator vectors. The rows would be mapped to two clusters of points instead 
of two points. A k-means clustering algorithm would find the clusters. 


If the clusters were overlapping, then instead of two clusters of points, there would be 
three clusters of points where the third cluster corresponds to the overlapping vertices of 
the two clusters. Instead of using k-means clustering, we might instead find the minimum 
l-norm vector in the space spanned by the two singular vectors. The minimum 1-norm 
vector will not be an indicator vector, so we would threshold its values to create an 
indicator vector for a cluster. Instead of finding the minimum 1-norm vector in the space 
spanned by the singular vectors in V, we might look for a small 1-norm vector close to 
the subspace. 

min(1 — |x], + acos(6)) 
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Here 0 is the cosine of the angle between x and the space spanned by the two singular 
vectors. & is a control parameter that determines how close we want the vector to be to 
the subspace. When a is large, x must be close to the subspace. When a is zero, x can 
be anywhere. 


Finding the minimum 1-norm vector in the space spanned by a set of vectors can be 
formulated as a linear programming problem. To find the minimum 1-norm vector in V, 
write Vx = y where we want to solve for both x and y. Note that the format is different 
from the usual format for a set of linear equations Ax = b where b is a known vector. 


Finding the minimum 1-norm vector looks like a non-linear problem. 
min |y|; subject to Vx = y 
To remove the absolute value sign, write y = yı — y2 with yı > 0 and y2 > 0. Then solve 
min (>: Yu + En) subject to Vx = y, y1 > 0, and y2 > 0. 
i=1 i=1 


Write Vx = yi — ye as Vx — yı + y2 = 0. then we have the linear equations in a format 
we are accustomed to. 


0 

X 0 

[V, —I, I] yı = . 
y2 0 


This is a linear programming problem. The solution, however, happens to be x = 0, 
yi = 0, and ya = 0. To resolve this, add the equation yı; = 1 to get a community con- 
taining the vertex 2. 


Often we are looking for communities of 50 or 100 vertices in graphs with hundreds of 
million of vertices. We want a method to find such communities in time proportional to 
the size of the community and not the size of the entire graph. Here spectral clustering 
can be used but instead of calculating singular vectors of the entire graph, we do some- 
thing else. Consider a random walk on a graph. If we walk long enough the probability 
distribution converges to the first eigenvector. However, if we take only a few steps from a 
start vertex or small group of vertices that we believe define a cluster, the probability will 
distribute over the cluster with some of the probability leaking out to the remainder of 
the graph. To get the early convergence of several vectors that ultimately converge to the 
first few singular vectors, take a subspace |x, Ax, A?x, A*x] and propagate the subspace. 
At each iteration find an orthonormal basis and then multiply each basis vector by A. 
Then take the resulting basis vectors after a few steps, say five, and find a minimum 
1-norm vector in the subspace. 
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A third issue that arises is when a dominant structure hides an important weaker 
structure. One can run their algorithm to find the dominant structure and then weaken the 
dominant structure by randomly removing edges in the clusters so that the edge density is 
similar to the remainder of the network. Then reapplying the algorithm often will uncover 
weaker structure. Real networks often have several levels of structure. The technique 
can also be used to improve state of the art clustering algorithms. After weakening the 
dominant structure to find the weaker hidden structure one can go back to the original data 
and weaken the hidden structure and reapply the algorithm to again find the dominant 
structure. This improves most state of the art clustering algorithms. 
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7.13 Bibliographic Notes 


Clustering has a long history. For a good general survey, see [Jail0]. For a collection 
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7.14 Exercises 


Exercise 7.1 Construct examples where using distances instead of distance squared gives 
bad results for Gaussian densities. For example, pick samples from two 1-dimensional 
unit variance Gaussians, with their centers 10 units apart. Cluster these samples by trial 
and error into two clusters, first according to k-means and then according to the k-median 
criteria. The k-means clustering should essentially yield the centers of the Gaussians as 
cluster centers. What cluster centers do you get when you use the k-median criterion? 


Exercise 7.2 Letv = (1,3). What is the Lı norm of v? The Lz norm? The square of 
the Ly norm? 


Exercise 7.3 Show that in 1-dimension, the center of a cluster that minimizes the sum 
of distances of data points to the center is in general not unique. Suppose we now require 
the center also to be a data point; then show that it is the median element (not the mean). 
Further in 1-dimension, show that if the center minimizes the sum of squared distances 
to the data points, then it is unique. 


Exercise 7.4 Construct a block diagonal matrix A with three blocks of size 50. Each 
matrix element in a block has value p = 0.7 and each matrix element not in a block has 
value q = 0.3. Generate a 150 x 150 matriz B of random numbers in the range [0,1]. If 
bij > aij replace aj; with the value one. Otherwise replace as; with value zero. The rows 
of A have three natural clusters. Generate a random permutation and use it to permute 
the rows and columns of the matrix A so that the rows and columns of each cluster are 
randomly distributed. 


1. Apply the k-means algorithm to A with k = 3. Do you find the correct clusters? 


2. Apply the k-means algorithm to A for 1 < k < 10. Plot the value of the sum of 
squares to the cluster centers versus k. Was three the correct value for k? 


Exercise 7.5 Let M be ak x k matriz whose elements are numbers in the range [0,1]. 
A matrix entry close to one indicates that the row and column of the entry correspond to 
closely related items and an entry close to zero indicates unrelated entities. Develop an 
algorithm to match each row with a closely related column where a column can be matched 
with only one row. 


Exercise 7.6 The simple greedy algorithm of Section 7.3 assumes that we know the clus- 
tering radius r. Suppose we do not. Describe how we might arrive at the correct r? 


Exercise 7.7 For the k-median problem, show that there is at most a factor of two ratio 
between the optimal value when we either require all cluster centers to be data points or 
allow arbitrary points to be centers. 


Exercise 7.8 For the k-means problem, show that there is at most a factor of four ratio 
between the optimal value when we either require all cluster centers to be data points or 
allow arbitrary points to be centers. 
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Exercise 7.9 Consider clustering points in the plane according to the k-median criterion, 
where cluster centers are required to be data points. Enumerate all possible clustering’s 
and select the one with the minimum cost. The number of possible ways of labeling n 
points, each with a label from {1,2,...,k} is k” which is prohibitive. Show that we can 
find the optimal clustering in time at most a constant times (7) +k?. Note that (7) < nk 
which is much smaller than k” when k << n. 


Exercise 7.10 Suppose in the previous exercise, we allow any point in space (not neces- 
sarily data points) to be cluster centers. Show that the optimal clustering may be found 
in time at most a constant times n? 


Exercise 7.11 Corollary 7.2 shows that for a set of points {a1,a2,..., an}, there is a 


unique point x, namely their centroid, which minimizes Y` |a; — x|?. Show examples 
i=1 


n 

where the x minimizing X` la; — x| is not unique. (Consider just points on the real line.) 
i=l 

Show examples where the x defined as above are far apart from each other. 


n 
Exercise 7.12 Let (a¡,a2,...,an) be a set of unit vectors in a cluster. Let c = 4 So aj 

i=l 
be the cluster centroid. The centroid e is not in general a unit vector. Define the similarity 
between two points a; and a; as their dot product. Show that the average cluster similarity 


= Y ajaj! is the same whether it is computed by averaging all pairs or computing the 
i,j 

average similarity of each point with the centroid of the cluster. 

Exercise 7.13 For some synthetic data estimate the number of local minima for k-means 


by using the birthday estimate. Is your estimate an unbaised estimate of the number? an 
upper bound? a lower bound? Why? 


Exercise 7.14 Examine the example in Figure 7.9 and discuss how to fix it. Optimizing 
according to the k-center or k-median criteria would seem to produce clustering B while 
clustering A seems more desirable. 


Exercise 7.15 Prove that for any two vectors a and b, |a — b|? > ¿[a]? — |b|?. 
Exercise 7.16 Let A be annxd data matrix, B its best rank k approximation, and C the 
optimal centers for k-means clustering of rows of A. How is it possible that || A — B|} < 


lA- Cll? 


Exercise 7.17 Suppose S is a finite set of points in space with centroid u(S). If a set T 
of points is added to S, show that the centroid (SUT) of SUT is at distance at most 


ate lL) — 1(8)| from p(S). 
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Figure 7.9: insert caption 


Exercise 7.18 What happens if we relax this restriction, for example, if we allow for S, 
the entire set? 


Exercise 7.19 Given the graph G = (V, E) of a social network where vertices represent 
individuals and edges represent relationships of some kind, one would like to define the 
concept of a community. A number of different definitions are possible. 


1. A subgraph S = (Vs, Es) whose density Es is greater than that of the graph +. 
S 


2. A subgraph S with a low conductance like property such as the number of graph edges 
leaving the subgraph normalized by the minimum size of S or V — S where size is 
measured by the sum of degrees of vertices in S or in V — S. 


3. A subgraph that has more internal edges than in a random graph with the same 
degree distribution. 


Which would you use and why? 


Exercise 7.20 A stochastic matrix is a matrix with non- negative entries in which each 
row sums to one. Show that for a stochastic matrix, the largest eigenvalue is one. Show 
that the eigenvalue has multiplicity one if and only if the corresponding Markov Chain is 
connected. 


Exercise 7.21 Show that if P is a stochastic matrix and n satisfies Tipi; = TjPji, then 
for any left eigenvector v of P, the vector u with components u; = 2 is a right eigenvector 
with the same eigenvalue. 


Exercise 7.22 Give an example of a clustering problem where the clusters are not linearly 
separable in the original space, but are separable in a higher dimensional space. 
Hint: Look at the example for Gaussian kernels in the chapter on learning. 


Exercise 7.23 The Gaussian kernel maps points to a higher dimensional space. What is 
this mapping? 
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Exercise 7.24 Agglomerative clustering requires that one calculate the distances between 
all pairs of points. If the number of points is a million or more, then this is impractical. 
One might try speeding up the agglomerative clustering algorithm by maintaining a 100 
clusters at each unit of time. Start by randomly selecting a hundred points and place each 
point in a cluster by itself. Each time a pair of clusters is merged randomly select one of 
the remaining data points and create a new cluster containing that point. Suggest some 
other alternatives. 

Exercise 7.25 Let A be the adjacency matriz of an undirected graph. Let d(S, S) = FT 
be the density of the subgraph induced by the set of vertices S. Prove that d (S, S) is the 


average degree of a vertex in S. Recall that A(S, T) = >> aij 
¡eS ¡eT 


Exercise 7.26 Suppose A is a matrix with non-negative entries. Show that A(S,T)/(|S||T]) 


is maximized by the single edge with highest aij. Recall that A(S,T) = Y” aj; 
¡eS jeT 


Exercise 7.27 Suppose A is a matrix with non-negative entries and 


01 (A) = = x" Ay = Za QijYj, |x| = ly] =1. 


Zero out all x; less than 1/2,/n and all y; less than 1/2Vd. Show that the loss is no more 
than Ya? of o, (A). 


Exercise 7.28 Consider other measures of density such as gore for different values of 
p. Discuss the significance of the densest subgraph according to these measures. 


Exercise 7.29 Let A be the adjacency matrix of an undirected graph. Let M be the 
matrix whose ij element is aij — A Partition the vertices into two groups S and S. 
Let s be the indicator vector for the set S and let 5 be the indicator variable for S. Then 
sT Ms is the number of edges in S above the expected number given the degree distribution 
and s M8 is the number of edges from S to S above the expected number given the degree 


distribution. Prove that if s" Ms is positive sT Ms must be negative. 





Exercise 7.30 Which of the three axioms, scale invariance, richness, and consistency 
are satisfied by the following clustering algorithms. 


1. k-means 


2. Spectral Clustering. 
Exercise 7.31 (Research Problem): What are good measures of density that are also 


effectively computable? Is there empirical/theoretical evidence that some are better than 
others? 
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Exercise 7.32 Create a graph with a small community and start a random walk in the 
community. Calculate the frequency distribution over the vertices of the graph and normal- 
ize the frequency distribution by the stationary probability. Plot the ratio of the normalized 
frequency for the vertices of the graph. What is the shape of the plot for vertices in the 
small community? 


Exercise 7.33 


1. Create a random graph with the following two structures imbedded in it. The first 
structure has three equal size communities with no edges between communities and 
the second structure has five equal size communities with no edges between commu- 
nities. 


2. Apply a clustering algorithm to find the dominate structure. Which structure did 
you get? 


3. Weaken the dominant structure by removing a fraction of its edges and see if you 
can find the hidden structure. 


Exercise 7.34 Experiment with finding hidden communities. 


Exercise 7.35 Generate a bloc model with two equal size communities where p and q 
are the probabilities for the edge density within communities and the edge density between 
communities. Then generated a GNP model with probability (p + q)/2. Which of two 
models most likely generates a community with half of the vertices? 
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8 Random Graphs 


Large graphs appear in many contexts such as the World Wide Web, the internet, 
social networks, journal citations, and other places. What is different about the modern 
study of large graphs from traditional graph theory and graph algorithms is that here 
one seeks statistical properties of these very large graphs rather than an exact answer 
to questions on specific graphs. This is akin to the switch physics made in the late 19% 
century in going from mechanics to statistical mechanics. Just as the physicists did, one 
formulates abstract models of graphs that are not completely realistic in every situation, 
but admit a nice mathematical development that can guide what happens in practical 
situations. Perhaps the most basic model is the G (n,p) model of a random graph. In 
this chapter, we study properties of the G(n, p) model as well as other models. 


8.1 The G(n,p) Model 


The G (n, p) model, due to Erdós and Rényi, has two parameters, n and p. Here n is 
the number of vertices of the graph and p is the edge probability. For each pair of distinct 
vertices, v and w, p is the probability that the edge (v,w) is present. The presence of each 
edge is statistically independent of all other edges. The graph-valued random variable 
with these parameters is denoted by G (n, p). When we refer to “the graph G (n, p)’, we 
mean one realization of the random variable. In many cases, p will be a function of n 
such as p = d/n for some constant d. For example, if p = d/n then the expected degree 
of a vertex of the graph is (n — 1)4 = d. In order to simplify calculations in this chapter, 
we will often use the approximation that nol = 1. In fact, conceptually it is helpful to 
think of n as both the total number of vertices and as the number of potential neighbors 
of any given node, even though the latter is really n — 1; for all our calculations, when n 


is large, the correction is just a low-order term. 


The interesting thing about the G(n,p) model is that even though edges are chosen 
independently with no “collusion”, certain global properties of the graph emerge from the 
independent choices. For small p, with p = d/n, d < 1, each connected component in the 
graph is small. For d > 1, there is a giant component consisting of a constant fraction of 
the vertices. In addition, there is a rapid transition at the threshold d = 1. Below the 
threshold, the probability of a giant component is very small, and above the threshold, 
the probability is almost one. 


The phase transition at the threshold d = 1 from very small o(n) size components to a 
giant Q(n) sized component is illustrated by the following example. Suppose the vertices 
represent people and an edge means the two people it connects know each other. Given a 
chain of connections, such as A knows B, B knows C, C knows D, ..., and Y knows Z, we 
say that A indirectly knows Z. Thus, all people belonging to a connected component of 
the graph indirectly know each other. Suppose each pair of people, independent of other 
pairs, tosses a coin that comes up heads with probability p = d/n. If it is heads, they 
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Figure 8.1: Probability of a giant component as a function of the expected number of 
people each person knows directly. 


know each other; if it comes up tails, they dont. The value of d can be interpreted as the 
expected number of people a single person directly knows. The question arises as to how 
large are sets of people who indirectly know each other? 


If the expected number of people each person knows is more than one, then a giant 
component of people, all of whom indirectly know each other, will be present consisting 
of a constant fraction of all the people. On the other hand, if in expectation, each person 
knows less than one person, the largest set of people who know each other indirectly is a 
vanishingly small fraction of the whole. Furthermore, the transition from the vanishing 
fraction to a constant fraction of the whole, happens abruptly between d slightly less than 
one to d slightly more than one. See Figure 8.1. Note that there is no global coordination 
of who knows whom. Each pair of individuals decides independently. Indeed, many large 
real-world graphs, with constant average degree, have a giant component. This is perhaps 
the most important global property of the G(n, p) model. 


8.1.1 Degree Distribution 


One of the simplest quantities to observe in a real graph is the number of vertices 
of given degree, called the vertex degree distribution. It is also very simple to study 
these distributions in G (n, p) since the degree of each vertex is the sum of n independent 
random variables, which results in a binomial distribution. 


Example: In G(n, 2), each vertex is of degree close to n/2. In fact, for any e > 0, the 


degree of each vertex almost surely is within 1 + € times n/2. To see this, note that the 
degree of a vertex is the sum of n — 1 & n indicator variables that take on value one or 
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A randomly generated G(n, p) graph with 40 vertices and 24 edges 


E 


Figure 8.2: Two graphs, each with 40 vertices and 24 edges. The second graph was 
randomly generated using the G(n, p) model with p = 1.2/n. A graph similar to the top 
graph is almost surely not going to be randomly generated in the G(n, p) model, whereas 
a graph similar to the lower graph will almost surely occur. Note that the lower graph 
consists of a giant component along with a number of small components that are trees. 
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Figure 8.3: Illustration of the binomial and the power law distributions. 


zero depending on whether the edge is present or not, each of mean i and variance L. The 
expected value of the sum is the sum of the expected values and the variance of the sum 
is the sum of the variances, and hence the degree has mean = 5 and variance ~ %. Thus, 
the probability mass is within an additive term of +cy/n of the mean for some constant 


c and thus within a multiplicative factor of 1+ e of 5 for sufficiently large n. A 


The degree distribution of G (n, p) for general p is also binomial. Since p is the prob- 
ability of an edge being present, the expected degree of a vertex is p(n — 1) = pn. The 
degree distribution is given by 


Prob(vertex has degree k) = ea el =p (pr = pyr, 


The quantity (2) is the number of ways of choosing k edges, out of the possible n edges, 
and p*(1—p)"~* is the probability that the k selected edges are present and the remaining 
n — k are not. 


The binomial distribution falls off exponentially fast as one moves away from the mean. 
However, the degree distributions of graphs that appear in many applications do not ex- 
hibit such sharp drops. Rather, the degree distributions are much broader. This is often 
referred to as having a “heavy tail”. The term tail refers to values of a random variable 
far away from its mean, usually measured in number of standard deviations. Thus, al- 
though the G (n, p) model is important mathematically, more complex models are needed 
to represent real world graphs. 


Consider an airline route graph. The graph has a wide range of degrees from degree 
one or two for a small city to degree 100 or more, for a major hub. The degree distribution 
is not binomial. Many large graphs that arise in various applications appear to have power 
law degree distributions. A power law degree distribution is one in which the number of 
vertices having a given degree decreases as a power of the degree, as in 
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Number(degree k vertices) = cz, 


for some small positive real r, often just slightly less than three. Later, we will consider 
a random graph model giving rise to such degree distributions. 


The following theorem states that the degree distribution of the random graph G (n, p) 
is tightly concentrated about its expected value. That is, the probability that the degree 
of a vertex differs from its expected degree by more than A,/np, drops off exponentially 
fast with A. 


Theorem 8.1 Letv be a verter of the random graph G(n, p). Let a be a real number in 


(0, /7P). 
Probd(|np — deg(v)| > a/np) < 3e=%/8, 


Proof: The degree deg(v) is the sum of n — 1 independent Bernoulli random variables, 


%1,%92,...,%n—1, Where, x; is the indicator variable that the 1% edge from v is present. So, 
approximating n — 1 with n, the theorem follows from Theorem 12.6 in the appendix. 
A 


Although the probability that the degree of a single vertex differs significantly from 
its expected value drops exponentially, the statement that the degree of every vertex is 
close to its expected value requires that p is Q( 22). That is, the expected degree grows 
at least logarithmically with the number of vertices. 


9 


Corollary 8.2 Suppose e is a positive constant. If p > nn, then almost surely every 


vertex has degree in the range (1 — e)np to (1 +e)np. 





Proof: Apply Theorem 8.1 with a = e,/np to get that the probability that an individual 
vertex has degree outside the range [(1 — e)np, (1 + e)np] is at most 3e7"?/8, By the 


union bound, the probability that some vertex has degree outside this range is at most 
3ne-* "8/8, For this to be o(1), it suffices for p > 2na, E 


Note that the assumption p is 0(2ez) is necessary. If p = d/n for d a constant, 


then some vertices may well have degrees outside the range |(1 — e)d, (1 + )d]. Indeed, 
shortly we will see that it is highly likely that for p = z there is a vertex of degree 
Q(log n/loglogn). Moreover, for p = + it is easy to see that with high probability there 
will be at least one vertex of degree zero. 





When p is a constant, the expected degree of vertices in G (n, p) increases with n. In 
G (n, 1) the expected degree of a vertex is approximately n/2. In many real applications, 
we will be concerned with G (n, p) where p = d/n, for d a constant, i.e., graphs whose 
expected degree is a constant d independent of n. As n goes to infinity, the binomial 


mo- (YY 


distribution with p = £ 
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n 


approaches the Poisson distribution 


d* —d 
Prob(k) = He 


To see this, assume k = o(n) and use the approximations $e x ynka n, and 
(1 — as N (1 — a)” = e~t, Then 


ny (aX d\"* nk dk dk 
li ey, ele Se PA 8 
100 (E) (5) ( 3 kl nEs kl 


Note that for p = a where d is a constant independent of n, the probability of the 
binomial distribution falls off rapidly for k > d, and is essentially zero once k! dominates 
d*. This justifies the k = o(n) assumption. Thus, the Poisson distribution is a good 
approximation. 


Example: In G(n, +) many vertices are of degree one, but not all. Some are of degree 
zero and some are of degree greater than one. In fact, it is highly likely that there is a 
vertex of degree Q(log n/loglogn). The probability that a given vertex is of degree k is 


O CY =F 


If k = logn/ log log n, 


l 
ann (log log n — log log logn) < logn 
log log n 


log k” = klog k = 
and thus kë < n. Since k! < k* < n, the probability that a vertex has degree k = 
log n/ log log n is at least ne! > =. If the degrees of vertices were independent random 
variables, then this would be enough to argue that there would be a vertex of degree 


n 1 
log n/log log n with probability at least 1 — (1 — 2) = ]—e e 20.31. But the degrees 
are not quite independent since when an edge is added to the graph it affects the degree 
of two vertices. This is a minor technical point, which one can get around. A 


8.1.2 Existence of Triangles in G(n, d/n) 


What is the expected number of triangles in G (n, 2), when d is a constant? As the 
number of vertices increases one might expect the number of triangles to increase, but this 
is not the case. Although the number of triples of vertices grows as n?, the probability 
of an edge between two specific vertices decreases linearly with n. Thus, the probability 
of all three edges between the pairs of vertices in a triple of vertices being present goes 


down as n~’, exactly canceling the rate of growth of triples. 
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A random graph with n vertices and edge probability d/n, has an expected number 
of triangles that is independent of n, namely d?/6. There are (3) triples of vertices. 


Each triple has probability (o of being a triangle. Let Ay; be the indicator variable 
for the triangle with vertices 2, j, and k being present. That is, all three edges (i, 7), 
(j,k), and (i,k) being present. Then the number of triangles is x = Dijk Aizk. Even 
though the existence of the triangles are not statistically independent events, by linearity 
of expectation, which does not assume independence of the variables, the expected value 
of a sum of random variables is the sum of the expected values. Thus, the expected 
number of triangles is 


Elx) =E ( Y As ) = Y E(Ain) = Es (5 x a 


ijk ijk 


Even though on average there are È triangles per graph, this does not mean that with 
high probability a graph has a triangle. Maybe half of the graphs have i triangles and 
the other half have none for an average of triangles. Then, with probability 1/2, a 
graph selected at random would have no triangle. If 1/n of the graphs had Ën triangles 
and the remaining graphs had no triangles, then as n goes to infinity, the probability that 
a graph selected at random would have a triangle would go to zero. 


We wish to assert that with some non-zero probability there is at least one triangle 
in G(n,p) when p = a. If all the triangles were on a small number of graphs, then the 
number of triangles in those graphs would far exceed the expected value and hence the 
variance would be high. A second moment argument rules out this scenario where a small 


fraction of graphs have a large number of triangles and the remaining graphs have none. 


Let's calculate E(x?) where x is the number of triangles. Write x as £ = ) 4; Aue, 
where A;¿x is the indicator variable of the triangle with vertices 7,7, and k being present. 
Expanding the squared term 


E(x’) =E ( S Aijr J =E ( `Z AijkAirjk ) f 


i j,k i, j, k 
vg! ki 


Split the above sum into three parts. In Part 1, let Sı be the set of i,j,k and 7’, 7’, k’ 
which share at most one vertex and hence the two triangles share no edge. In this case, 
Ajjx and Ajj, are independent and 


E ( NA sj Air ) => E(Aj)E(Ayjm) < ( S E(A sj) ) ( NO E(Ayj) ) = E*(2). 
Si Si 


all all 
ijk a glk! 
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The two triangles of Part 1 are either The two triangles The two triangles in 
disjoint or share at most one vertex of Part 2 share an Part 3 are the same tri- 
edge angle 


Figure 8.4: The triangles in Part 1, Part 2, and Part 3 of the second moment argument 
for the existence of triangles in G(n, £). 


In Part 2, i,j,k and 7’, j’,k’ share two vertices and hence one edge. See Figure 8.4. 
Four vertices and five edges are involved overall. There are at most (7) € O(n‘), 4-vertex 
subsets and (5) ways to partition the four vertices into two triangles with a common edge. 
The probability of all five edges in the two triangles being present is př, so this part sums 
to O(n*p?) = O(d?/n) and is o(1). There are so few triangles in the graph, the probability 
of two triangles sharing an edge is extremely unlikely. 


In Part 3, i,j,k and 7’, 7’,k’ are the same sets. The contribution of this part of the 
summation to E(x?) is E(x). Thus, putting all three parts together, 


E(x") < E*(x) + E(x) + o(1), 
which implies 
Var(x) = E(x*) — Elx) < Elx) + o(1). 


For x to be equal to zero, it must differ from its expected value by at least its expected 
value. Thus, 
Prob(x = 0) < Prob (|x — E(x)| > Elo). 


By Chebychev inequality, 





E(x) + o(1) P 6 


z Var(x) 
Prob(x = 0) < Pa) 2 


= Pn) < 


+ o(1). (8.1) 


Thus, for d > Y6 = 1.8, Prob(x = 0) < 1 and G(n,p) has a triangle with non-zero 
probability. For d < W6, E(x) = È < 1 and there simply are not enough edges in the 
graph for there to be a triangle. 


8.2 Phase Transitions 


Many properties of random graphs undergo structural changes as the edge probability 
passes some threshold value. This phenomenon is similar to the abrupt phase transitions in 
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physics, as the temperature or pressure increases. Some examples of this are the abrupt 
appearance of cycles in G(n, p) when p reaches 1/n and the disappearance of isolated 
vertices when p reaches mr, The most important of these transitions is the emergence of 
a giant component, a connected component of size O(n), which happens at d = 1. Recall 


Figure 8.1. 



































Probability | Transition 
1 Forest of trees, no component 
p= A) of size greater than O(log n) 
E Al Cycles appear, no component 
n? of size greater than O(log n) 
p= 2 d= Components of size O(n3) 
ptas Giant component plus O(log n) 
n components 
A Giant component plus isolated 
PZA vertices 
Disappearance of isolated vertices 
p= mn Appearance of Hamilton circuit 
Diameter O(log n) 
p= 4/2 Diameter two 
p=3 Clique of size (2 — e) Inn 











Table 1: Phase transitions 


For these and many other properties of random graphs, a threshold exists where an 
abrupt transition from not having the property to having the property occurs. If there 


exists a function p(n) such that when lim pe) = 0, G (n, pı (n)) almost surely does not 
n— Oo 


have the property, and when lim Pata) = œ, G (n, pa (n)) almost surely has the property, 
n—>00 


then we say that a phase transition occurs, and p(n) is the threshold. Recall that G(n, p) 
“almost surely does not have the property” means that the probability that it has the 
property goes to zero in the limit, as n goes to infinity. We shall soon see that every 
increasing property has a threshold. This is true not only for increasing properties of 
G (n, p), but for increasing properties of any combinatorial structure. If for cp (n), c < 1, 
the graph almost surely does not have the property and for cp(n), c > 1, the graph 
almost surely has the property, then p(n) is a sharp threshold. The existence of a giant 
component has a sharp threshold at 1/n. We will prove this later. 


In establishing phase transitions, we often use a variable x(n) to denote the number 


of occurrences of an item in a random graph. If the expected value of x(n) goes to zero as 
n goes to infinity, then a graph picked at random almost surely has no occurrence of the 
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Prob(x > 0) 




















nite  nlogn 





Figure 8.5: Figure 8.5(a) shows a phase transition at p = L, The dotted line shows 
an abrupt transition in Prob(x) from 0 to 1. For any function asymptotically less than 
L, Prob(x)>0 is zero and for any function asymptotically greater than L, Prob(x)>0 is 
one. Figure 8.5(b) expands the scale and shows a less abrupt change in probability unless 
the phase transition is sharp as illustrated by the dotted line. Figure 8.5(c) is a further 
expansion and the sharp transition is now more smooth. 


item. This follows from Markov’s inequality. Since x is a non-negative random variable 
Prob(x > a) < ¿E(x), which implies that the probability of x(n) > 1 is at most E(x(n)). 
That is, if the expected number of occurrences of an item in a graph goes to zero, the 
probability that there are one or more occurrences of the item in a randomly selected 
graph goes to zero. This is called the first moment method. 


The previous section showed that the property of having a triangle has a threshold at 
p(n) = 1/n. If the edge probability p,(n) is o(1/n), then the expected number of triangles 
goes to zero and by the first moment method, the graph almost surely has no triangle. 
However, if the edge probability pa(n) satisfies pa) — oo, then from (8.1), the probability 
of having no triangle is at most 6/d* + 0(1) = 6/(npa2(n))?+o(1), which goes to zero. This 
latter case uses what we call the second moment method. The first and second moment 
methods are broadly used. We describe the second moment method in some generality 
now. 


When the expected value of x(n), the number of occurrences of an item, goes to 
infinity, we cannot conclude that a graph picked at random will likely have a copy since 
the items may all appear on a vanishingly small fraction of the graphs. We resort to a 
technique called the second moment method. It is a simple idea based on Chebyshev’s 
inequality. 


Theorem 8.3 (Second Moment method) Let x(n) be a random variable with E(x) > 0. 
If 
Var(x) = o( £7(2)), 


then x is almost surely greater than zero. 
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Figure 8.6: If the expected fraction of the number of graphs in which an item occurs 
did not go to zero, then E (x), the expected number of items per graph, could not be 
zero. Suppose 10% of the graphs had at least one occurrence of the item. Then the 
expected number of occurrences per graph must be at least 0.1. Thus, E (x) > 0 implies 
the probability that a graph has an occurrence of the item goes to zero. However, the 
other direction needs more work. If E (x) is large, a second moment argument is needed 
to conclude that the probability that a graph picked at random has an occurrence of the 
item is non-negligible, since there could be a large number of occurrences concentrated on 
a vanishingly small fraction of all graphs. The second moment argument claims that for 
a non-negative random variable x with E (x) > 0, if Var(x) is o(E? (x)) or alternatively 
if E (x?) < E? (x) (14+ o(1)), then almost surely x > 0. 


Proof: If E(x) > 0, then for x to be less than or equal to zero, it must differ from its 
expected value by at least its expected value. Thus, 


Prob(x < 0) < Prob (lz — E(x)| > E(x)) 


By Chebyshev inequality 





z Var(x) 


Prob (le - E(x)| > E(2)) < r= 


> 0. 


Thus, Prob(x < 0) goes to zero if Var(x) is o (E*(x)). E 
Corollary 8.4 Let x be a random variable with E(x) > 0. If 
Ela?) < E (x) (1 + o(1)), 
then x is almost surely greater than zero. 
Proof: If E(x?) < E?(x)(1 +o(1)), then 
Var(x) = Ela?) — E?’ (x) < E’ (x)o(1) = o(E*(x)). 
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Second moment arguments are more difficult than first moment arguments since they 
deal with variance and without independence we do not have E(xy) = E(x)E(y). In the 
triangle example, dependence occurs when two triangles share a common edge. However, 
if p = 2 there are so few triangles that almost surely no two triangles share a common 
edge and the lack of statistical independence does not affect the answer. In looking for 
a phase transition, almost always the transition in probability of an item being present 


occurs when the expected number of items transitions. 
Threshold for graph diameter two (two degrees of separation) 


We now present the first example of a sharp phase transition for a property. This 
means that slightly increasing the edge probability p near the threshold takes us from 
almost surely not having the property to almost surely having it. The property is that 
of a random graph having diameter less than or equal to two. The diameter of a graph 
is the maximum length of the shortest path between a pair of nodes. In other words, the 
property is that every pair of nodes has “at most two degrees of separation”. 


The following technique for deriving the threshold for a graph having diameter two 
is a standard method often used to determine the threshold for many other objects. Let 
x be a random variable for the number of objects such as triangles, isolated vertices, or 
Hamiltonian circuits, for which we wish to determine a threshold. Then we determine 
the value of p, say po, where the expected value of x goes from vanishingly small to un- 
boundedly large. For p < pp almost surely a graph selected at random will not have a 
copy of the item. For p > po, a second moment argument is needed to establish that the 
items are not concentrated on a vanishingly small fraction of the graphs and that a graph 
picked at random will almost surely have a copy. 


Our first task is to figure out what to count to determine the threshold for a graph 
having diameter two. A graph has diameter two if and only if for each pair of vertices 7 
and j, either there is an edge between them or there is another vertex k to which both i 
and j have an edge. So, what we will count is the number of pairs 7 and j that fail, i.e., 
the number of pairs 7 and j that have more than two degrees of separation. The set of 
neighbors of 7 and the set of neighbors of 7 are random subsets of expected cardinality 
np. For these two sets to intersect requires np = yn or p = =. Such statements often 
go under the general name of “birthday paradox” though it is not a paradox. In what 
follows, we will prove a threshold of O(VInn/,/n) for a graph to have diameter two. The 


extra factor of vlnn ensures that every one of the (5) pairs of ¿ and j has a common 
neighbor. When p = ca / nn for c < V2, the graph almost surely has diameter greater 


than two and for c > V2, the graph almost surely has diameter less than or equal to two. 


Theorem 8.5 The property that G(n,p) has diameter two has a sharp threshold at 


p = V2 =. 
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Proof: If G has diameter greater than two, then there exists a pair of non-adjacent ver- 
tices ¿ and j such that no other vertex of G is adjacent to both 7 and 7. This motivates 
calling such a pair bad. 


Introduce a set of indicator variables J;;, one for each pair of vertices (i, j) with i < j, 
where J;; is 1 if and only if the pair (i, j) is bad. Let 


f=) dy 


i<j 


be the number of bad pairs of vertices. Putting i < j in the sum ensures each pair (i, 7) 

is counted only once. A graph has diameter at most two if and only if it has no bad pair, 

i.e., z = 0. Thus, if lim E(x) = 0, then for large n, almost surely, a graph has no bad 
n—- Ooo 


pair and hence has diameter at most two. 


The probability that a given vertex is adjacent to both vertices in a pair of vertices 
(i, j) is p?. Hence, the probability that the vertex is not adjacent to both vertices is 
1 — p°. The probability that no vertex is adjacent to the pair (i, j) is (1 — eye and the 
probability that 7 and j are not adjacent is 1 — p. Since there are (5) pairs of vertices, 
the expected number of bad pairs is 


Setting p = c,/ nr, 


~ n2 —c2lnn 
œ~ 1,,2—c? 


For c > V2, lim E(x) = 0. By the first moment method, for p = c4/ an with c > V2, 
n—>>00 


G (n, p) almost surely has no bad pair and hence has diameter at most two. 


Next, consider the case c < 2 where lim E (x) = 00. We appeal to a second moment 


argument to claim that almost surely a eraph.] has a bad pair and thus has diameter greater 
than two. 


2 
=E (= 1 =E (= I; >) =F | Y Liu | = E (In). 
A ae g g 


The summation can be partitioned into three summations depending on the number of 
distinct indices among 1, 7, k, and l. Call this number a. 
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E (2%) = Y E(lgln)+ Y, EUule)+ >> E(B). (8.2) 


ta A i<j 
k<l i<j 


Consider the case a = 4 where i,j,k, and l are all distinct. If [;;,[,, = 1, then both 
pairs (7,7) and (k,l) are bad and so for each u not in fi, j, k,l}, at least one of the edges 
(i u) or (j, u) is absent and, in addition, at least one of the edges (k, u) or (l, u) is absent. 
The probability of this for one u not in {i, j, k,l} is (1 — p?)?. As u ranges over all the 
n — 4 vertices not in (1, j, k,l}, these events are all independent. Thus, 


Elija) < (1 =- p)?’ < (1 = anny +o(1)) <n (1 + o(1)) 


and the first sum is 
ni?" (1 + o(1)), 


aja 


Y Elli) < 
1<) 
k<l 

where, the z is because only a fourth of the 4-tupples (i, j, k,l) have i < j and k < l. 


For the second summation, observe that if 1;¿[;, = 1, then for every vertex u not equal 
to i, j, or k, either there is no edge between i and u or there is an edge (i, u) and both 
edges (j,u) and (k,u) are absent. The probability of this event for one u is 


1—p+p(l—p)? = 1—2p? +p? = 1 — 2p’. 


Thus, the probability for all such u is (1 — eo "7, Substituting c nn for p yields 


(1 _ 2c? an) a) ent” Inn _ ne 


n y 


which is an upper bound on E(1;;f;1) for one i, j, k, and l with a = 3. Summing over all 
distinct triples yields n?-?” for the second summation in (8.2). 


For the third summation, since the value of J;; is zero or one, E (17) = E (I;;). Thus, 
JE (}) = E (2). 
tj 


Hence, E (x?) < E + n3-2 4 92% and E (x) S E from which it follows that 
for c < y2, E(x?) < E? (x) (1+ o(1)). By a second moment argument, Corollary 8.4, a 
graph almost surely has at least one bad pair of vertices and thus has diameter greater 
than two. Therefore, the property that the diameter of G(n, p) is less than or equal to 


two has a sharp threshold at p = //2,/22 E 


n 
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Disappearance of Isolated Vertices 


The disappearance of isolated vertices in G(n,p) has a sharp threshold at Ez. At 
this point the giant component has absorbed all the small components and with the 
disappearance of isolated vertices, the graph becomes connected. 


Theorem 8.6 The disappearance of isolated vertices in G (n, p) has a sharp threshold of 
Inn 


Proof: Let x be the number of isolated vertices in G (n, p). Then, 
Ela) =n(1-p"”. 
Since we believe the threshold to be nr, consider p = cuz, Then, 


lim E (x) = lim n (1 — cmn)” = lim ne” = lim n!™®., 

n—>00 n—>00 n—>>00 n—>>00 
If c >1, the expected number of isolated vertices, goes to zero. If c < 1, the expected 
number of isolated vertices goes to infinity. If the expected number of isolated vertices 
goes to zero, it follows that almost all graphs have no isolated vertices. On the other 
hand, if the expected number of isolated vertices goes to infinity, a second moment ar- 
gument is needed to show that almost all graphs have an isolated vertex and that the 
isolated vertices are not concentrated on some vanishingly small set of graphs with almost 
all graphs not having isolated vertices. 


Assume c < 1. Write x = l +12+::-+1Í, where J; is the indicator variable indicating 
whether vertex i is an isolated vertex. Then E (x?) = Y E (I) +25 E (1,[;). Since I; 


i=l 1<j 
equals 0 or 1, J? = J; and the first sum has value E (x). Since all elements in the second 
sum are equal 


E (x°) = E (2) +n(n-1) E (hh) 
=E(a)+n(n-1)(1-p 0", 





The minus one in the exponent 2(n — 1) — 1 avoids counting the edge from vertex 1 to 
vertex 2 twice. Now, 


EG). wap) Gade 
E? (x) n2(1-p 0D 











1 1 1 
= n—1 (1- ) i 
n(1—p) ge ee 








N—>00 
E(x”) , = 1 1 f lnn 
mite EP (g) ns = | (== ee, A 
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Figure 8.7: A degree three vertex with three adjacent degree two vertices. Graph cannot 
have a Hamilton circuit. 


By the second moment argument, Corollary 8.4, the probability that x = 0 goes to zero 
implying that almost all graphs have an isolated vertex. Thus, nn is a sharp threshold 
for the disappearance of isolated vertices. For p = cuz, when c > 1 there almost surely 


are no isolated vertices, and when c < 1 there almost surely are isolated vertices. A 


Hamilton circuits 


So far in establishing phase transitions in the G(n, p) model for an item such as the 
disappearance of isolated vertices, we introduced a random variable x that was the number 
of occurrences of the item. We then determined the probability p for which the expected 
value of x went from zero to infinity. For values of p for which E(x) > 0, we argued that 
with high probability, a graph generated at random had no occurrences of x. For values of 
x for which E(x) > 00, we used the second moment argument to conclude that with high 
probability, a graph generated at random had occurrences of x. That is, the occurrences 
that forced E(x) to infinity were not all concentrated on a vanishingly small fraction of 
the graphs. One might raise the question for the G(n,p) graph model, do there exist 
items that are so concentrated on a small fraction of the graphs that the value of p where 
E(x) goes from zero to infinity is not the threshold? An example where this happens is 
Hamilton circuits. 


A Hamilton circuit is a simple cycle that includes all the vertices. For example, in a 
graph of 4 vertices, there are three possible Hamilton circuits: (1,2,3,4), (1,2,4,3), and 
(1,3,2,4). Note that our graphs are undirected, so the circuit (1,2,3,4) is the same as 
the circuit (1,4, 3, 2). 


Let x be the number of Hamilton circuits in G(n, p) and let p = d for some constant 


d. There are s(n — 1)! potential Hamilton circuits in a graph and each has probability 
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(4)" of actually being a Hamilton circuit. Thus, 


1 d\" 
e ie ee 
¿0-1 (5) 
a! 
e n 
l O d<e 
=> R 
oo d>e 
This suggests that the threshold for Hamilton circuits occurs when d equals Euler’s con- 


stant e. This is not possible since the graph still has isolated vertices and is not even 
connected for p = =. Thus, the second moment argument is indeed necessary. 


E(x) = 


The actual threshold for Hamilton circuits is 1 log n. For any p(n) asymptotically 
greater, G(n,p) will have a Hamilton circuit with probability one. This is the same 
threshold as for the disappearance of degree one vertices. Clearly a graph with a degree 
one vertex cannot have a Hamilton circuit. But it may seem surprising that Hamilton 
circuits appear as soon as degree one vertices disappear. You may ask why at the mo- 
ment degree one vertices disappear there cannot be a subgraph consisting of a degree 
three vertex adjacent to three degree two vertices as shown in Figure 8.7. The reason is 
that the frequency of degree two and three vertices in the graph is very small and the 
probability that four such vertices would occur together in such a subgraph is too small 
for it to happen with non-negligible probability. 


Explanation of component sizes 


In G(n, £) with d < 1, there are only small components of size at most Inn. With 
d > 1, there is a giant component and small components of size at most Inn until the 
graph becomes fully connected at p = nn There never are components of size between 
Inn and Q(n) except during the phase transition at p = L, To understand why there are 
no intermediate size components in G(n, 2) with d > 1 consider a breadth first search 
(bfs). Discovered but unexplored vertices are called the frontier. When the frontier be- 
comes empty the component has been found. The expected size of the frontier initially 
grows as d — 1, then slows down and eventually decreases until it becomes zero where the 
number of discovered vertices is a constant fraction of n. The actual size of the frontier is 
a random variable. Initially when the expected size of the frontier is small, the actual size 
can differ substantially from the expectation, and reach zero resulting in a small O(Inn) 
component. After Inn steps the expected size of the frontier is sufficiently large that the 
actual size cannot differ sufficiently to be zero and thus no component can be found until 
the expected size of the frontier is again close to zero, which occurs when the number of 
discovered vertices is a constant fraction of n. At this point the actual size of the fron- 
tier can differ enough from its expectation so it can be zero resulting in a giant component. 
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Figure 8.8: The solid curve is the expected size of the frontier. The two dashed curves 
indicate the high-probability range of possible values for the actual size of the frontier. 


Expected size of frontier 


To compute the size of a connected component of G (n, 2), do a breadth first search 
of a component starting from an arbitrary vertex and generate an edge only when the 
search process needs to know if the edge exists. Let us define a step in this process as 
the full exploration of one vertex. To aid in our analysis, let us imagine that whenever 
the bfs finishes, we create a brand-new (n + 1)st vertex (call it a “red vertex”) that is 
connected to each real vertex with probability a and we then continue the bfs from there. 
The red vertex becomes “explored” but it was never “discovered”. This modified process 
has the useful property that for each real (non-red) vertex u other than the start vertex, 
the probability that u is undiscovered after the first i steps is exactly (1 — dji, Define the 
size of the frontier to be the number of discovered vertices minus the number of explored 
vertices; this equals the number of vertices in the bfs frontier in the true bfs process, but 
it can now go negative once the true bfs completes and we start creating red vertices. The 
key point to keep in mind is that the true bfs must have completed by the time the size 
of the frontier reaches zero. For large n, 


1-(1-2) BU) enni 
n 


and the expected size of the frontier after i steps is n(1 — eat) — 1. Normalize the size of 
the frontier by dividing by n to get 1 — ne t, Let z = i be the normalized number of 
steps and let f(x) = 1 — e7% — x be the normalized expected size of the frontier. When 
d>1, f(0) =0 and f'(0) =d—1>0, so f is increasing at 0. But f(1) =—e"*<0. So, 
for some value 0,0 < 0 < 1, f(@) = 0. When d = 2, 6 = 0.7968. 


1 
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Difference of actual size of frontier from expected size 


For d > 1, the expected size of the frontier grows as (d—1)i for small i. The actual size 
of the frontier is a random variable. What is the probability that the actual size of the 
frontier will differ from the expected size of the frontier by a sufficient amount so that the 
actual size of the frontier is zero? To answer this, we need to understand the distribution 
of the number of discovered vertices after 7 steps. For small i, the probability that a vertex 
has been discovered is 1— (1—d/n)' ~ id/n and the binomial distribution for the number 
of discovered vertices, binomial(n, id), is well approximated by the Poisson distribution 
with the same mean ¿d. The probability that a total of k vertices have been discovered in 
i steps is approximately eta” For a connected component to have exactly i vertices, 
the frontier must drop to zero for the first time at step 7. A necessary condition is that 
exactly i vertices must have been discovered in the first i steps. The probability of this is 
approximately l l 

¿ui (di)" = E — p-(d-ligi — ¿ld-1-Imd)i 
i! q 

For d 4 1, d— 1 — Ind > 0.% Thus the probability e~¢-!-"% drops off exponentially 
with 7. For i > clnn and sufficiently large c, the probability that the breadth first search 
starting from a particular vertex terminates with a component of size i is o(1/n) as long 
as the Poisson approximation is valid. In the range of this approximation, the probability 
that a breadth first search started from any vertex terminates with ¿ > clnn vertices is 
o(1). Intuitively, if the component has not stopped growing within Q(Inn) steps, it is 
likely to continue to grow until it becomes much larger and the expected value of the size 
of the frontier again becomes small. While the expected value of the frontier is large, the 
probability that the actual size will differ from the expected size sufficiently for the actual 
size of the frontier to be zero is vanishingly small. 





For i near n@ the absolute value of the expected size of the frontier increases linearly 
with |i — n6|. Thus for the actual size of the frontier to be zero, the frontier size must 
deviate from its expected value by an amount proportional to |i — n6|. For values of i 
near nð, the binomial distribution can be approximated by a Gaussian distribution. The 
Gaussian falls off exponentially fast with the square of the distance from its mean. The 


distribution falls off proportional to e7 52 where 0? is the variance and is proportional to 
n. Thus to have a non-vanishing probability, k must be at most y/n. This implies that the 
giant component is in the range [(n0 — yn, n0 + y/n]. Thus a component is either small or 
in the range [n0 — yn, nd + yn]. 


36Let f(d) =d—1—Ind. Then $f =1-— } and $f <0 ford < 1 and $f >0 for d > 1. Now f(d) =0 
at d = 1 and is positive for all other d > 1. 
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Figure 8.9: Picture after en?/2 edge queries. The potential edges from the small con- 
nected components to unvisited vertices have all be explored and do not exist in the 
graph. However, since many edges must have been found the frontier must be big and 
hence there is a giant component. 


8.3 Giant Component 


Consider G (n, p) for p = ite where e is a constant greater than zero. We now show that 
with high probability, such a graph contains a giant component, namely a component of 
size Q(n). Moreover, with high probability, the graph contains only one such component, 
and all other components are much smaller, of size only O(logn). We begin by arguing 
existence of a giant component. 


8.3.1 Existence of a Giant Component 


To see that with high probability the graph has a giant component, do a depth first search 
(dfs) on G(n, p) where p = (1+ €)/n with 0 < e < 1/8. Note that it suffices to consider 
this range of e since increasing the value of p only increases the probability that the graph 
has a giant component. 

To perform the dfs, generate (z) Bernoulli(p) independent random bits and answer 
the t edge query according to the t” bit. As the dfs proceeds, let 


E = set of fully explored vertices whose exploration is complete 
U = set of unvisited vertices 


F = frontier of visited and still being explored vertices . 


Initially the set of fully explored vertices E and the frontier F are empty and the 
set of unvisited vertices, U equals {1,2,...,n}. If the frontier is not empty and u is the 
active vertex of the dfs, the dfs queries each unvisited vertex in U until it finds a vertex 
v for which there is an edge (u, v) and moves v from U to the frontier and v becomes the 
active vertex. If no edge is found from u to an unvisited vertex in U, then u is moved from 
the frontier to the set of fully explored vertices E. If frontier is empty, the dfs moves an 
unvisited vertex from U to the frontier and starts a new component. If both the frontier 
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and U are empty all connected components of G have been found. At any time all edges 
between the current fully explored vertices, E, and the current unvisited vertices, U, have 
been queried since a vertex is moved from the frontier to E only when there is no edge 
from the vertex to an unexplored vertex in U. 


Intuitively, after en?/2 edge queries a large number of edges must have been found since 
p= He, None of these can connect components already found with the set of unvisited 
vertices, and we will use this to show that with high probability the frontier must be large. 
Since the frontier will be in a connected component, a giant component exists with high 
probability. We first prove that after en?/2 edge queries the set of fully explored vertices 
is of size less than n/3. 


Lemma 8.7 After en?/2 edge queries, with high probability |E| < n/3. 


Proof: If not, at some t < en?/2, |E| = n/3. A vertex is added to frontier only when 
an edge query is answered yes. So at time t, |F| is less than or equal to the sum of en?/2 
Bernoulli(p) random variables, which with high probability is at most en?p < n/3. So, 
at t, |U| =n —|E| — |F| > n/3. Since there are no edges between fully explored vertices 
and unvisited vertices, |E| |U| > n?/9 edge queries must have already been answered in 
the negative. But t > n?/9 contradicts t < en?/2 < n?/16. Thus |E| < n/3. E 


The frontier vertices in the search of a connected component are all in the component 
being searched. Thus if at any time the frontier set has Q(n) vertices there is a giant 
component. 


Lemma 8.8 After en?/2 edge queries, with high probability the frontier F consists of at 
least €?n/30 vertices. 


Proof: After en?/2 queries, say, |F| < e?n/30. Then 


n en 

seme HRS == 1 

Uļ=n-]El-|Fl>n-4 -Z> 
and so the dfs is still active. Each positive answer to an edge query so far resulted in some 
vertex moving from U to F, which possibly later moved to E. The expected number of 
yes answers so far is pen? /2 = (1 + e)en/2 and with high probability, the number of yes 
answers is at least (en/2) + (€?n/3). So, 

en en en  3e?n 


E Fl>=p— = [E>  — : 
Breen er El ea 


We must have |E| |U| < en?/2. Now, |E||U| = |E|(n—|E|—|F]) increases as |E| increases 











from 5 den to n/3, so we have 
en  3en en  3en en en? 
ENU| > | — + — Sa = > 
B (F5) (e 3 0 a) a 
a contradiction. E 
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8.3.2 No Other Large Components 


We now argue that for p = (1 +.e)/n for constant e > 0, with high probability there is 
only one giant component, and in fact all other components have size O(log n). 


We begin with a preliminary observation. Suppose that a G(n, p) graph had at least 
a ô probability of having two (or more) components of size w(log n), i.e., asymptotically 
greater than logn. Then, there would be at least a 6/2 probability of the graph having 
two (or more) components with w(log n) vertices inside the subset A = {1,2,...,en/2}. 
The reason is that an equivalent way to construct a graph G(n, p) is to first create it in the 
usual way and then to randomly permute the vertices. Any component of size w(log n) 
will with high probability after permutation have at least an e/4 fraction of its vertices 
within the first en/2. Thus, it suffices to prove that with high probability at most one 
component has w(logn) vertices within the set A to conclude that with high probability 
the graph has only one component with w(logn) vertices overall. 


We now prove that with high probability, a G(n, p) graph for p = (1 + €)/n has at 
most one component with w(logn) vertices inside the set A. To do so, let B be the set 
of (1 — €/2)n vertices not in A. Now, construct the graph as follows. First, randomly 
flip coins of bias p to generate the edges within set A and the edges within set B. At 
this point, with high probability, B has at least one giant component, by the argument 
from Section 8.3.1, since p = (1 + €)/n > (1 + €/4)/|B| for 0 < e < 1/2. Let C* be 
a giant component inside B. Now, flip coins of bias p to generate the edges between A 
and B except for those incident to C*. At this point, let us name all components with 
w(logn) vertices inside A as C1, C2,C3,.... Finally, flip coins of bias p to generate the 
edges between A and C*. 


In the final step above, notice that with high probability, each C; is connected to C*. 
In particular, there are w(n logn) possible edges between any given C; and C*, each one 
of which is present with probability p. Thus the probability that this particular C; is not 
connected to C* is at most (1 — p)*’e™ = 1/n40. Thus, by the union bound, with 
high probability all such C; are connected to C*, and there is only one component with 
w(logn) vertices within A as desired. 


8.3.3 The Case of p< 1/n 


When p < 1/n, then with high probability all components in G(n, p) are of size O(log n). 
This is easiest to see by considering a variation on the above dfs that (a) begins with 
F containing a specific start vertex Ustart, and then (b) when a vertex u is taken from 
F to explore, it pops u off of F, explores u fully by querying to find all edges between 
u and U, and then pushes the endpoints v of those edges onto F. Thus, this is like an 
explicit-stack version of dfs, compared to the previous recursive-call version of dfs. Let us 
call the exploration of such a vertex u a step. To make this process easier to analyze, let 
us say that if F ever becomes empty, we create a brand-new, fake “red vertex”, connect it 
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to each vertex in U with probability p, place the new red vertex into F, and then continue 
the dís from there. 


Let z, denote the number of real (non-red) vertices discovered after k steps, not in- 
cluding Ustart. For any given real vertex u Æ Ustart, the probability that u is not discovered 
in k steps is (1 — p)*, and notice that these events are independent over the different ver- 
tices u Æ Ustart- Therefore, the distribution of zz is Binomial (n —-1,1-(1- p)*). Note 
that if zp < k then the process must have required creating a fake red vertex by step k, 
meaning that Ustart is in a component of size at most k. Thus, it suffices to prove that 
Prob(z, > k) < 1/n?, for k = clnn for a suitably large constant c, to then conclude by 
union bound over choices of Ustart that with high probability all vertices are in components 
of size at most clnn. 


To prove that Prob(z, > k) < 1/n? for k = clnn, we use the fact that (1—p)* > 1—pk 
so 1 — (1 — p)* < pk. So, the probability that z% is greater than or equal to k is at most 
the probability that a coin of bias pk flipped n — 1 times will have at least k heads. But 
since pk(n — 1) < (1 — e)k for some constant e > 0, by Chernoff bounds this probability 
is at most e~* for some constant cy > 0. When k = clnn for a suitably large constant 
c, this probability is at most 1/n?, as desired. 


8.4 Cycles and Full Connectivity 


This section considers when cycles form and when the graph becomes fully connected. 
For both of these problems, we look at each subset of k vertices and see when they form 
either a cycle or when they form a connected component. 


8.4.1 Emergence of Cycles 


The emergence of cycles in G(n,p) has a threshold when p equals to 1/n. However, 
the threshold is not sharp. 


Theorem 8.9 The threshold for the existence of cycles in G(n,p) is p = 1/n. 


Proof: Let x be the number of cycles in G (n, p). To form a cycle of length k, the vertices 
can be selected in de ways. Given the k vertices of the cycle, they can be ordered by 
arbitrarily selecting a first vertex, then a second vertex in one of k-1 ways, a third in one 
of k — 2 ways, etc. Since a cycle and its reversal are the same cycle, divide by 2. Thus, 
there are (2) Geet 


1) 9 Possible cycles of length k and 


— RA (k—-1)! nm nk is k _(np)?—2 
B(x) = >> (| Gok < Y pk < Y) (mp)! = (np)? = < Amp, 
aa k=3 k=3 


provided that np < 1/2. When p is asymptotically less than 1/n, then lim np = 0 and 
n—>00 


lim > (np) = 0. So, as n goes to infinity, E(x) goes to zero. Thus, the graph almost 
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surely has no cycles by the first moment method. A second moment argument can be 
used to show that for p = d/n, d > 1, a graph will have a cycle with probability tending 
to one. E 


The argument above does not yield a sharp threshold since we argued that E(x) > 0 
only under the assumption that p is asymptotically less than L, A sharp threshold requires 
E(x) > 0 for p= d/n,d < 1. 


Consider what happens in more detail when p = d/n, d a constant. 


roD (Er 














E (x) converges if d < 1, and diverges if d > 1, If d < 1, E (x) < sf and lim E (a) 


1 
a n— o0 


oy) 


—k+1) 1 


equals a constant greater than zero. If d = 1, E(x) = 3 2 dm y ;- Consider 








only the first logn terms of the sum. Since = 14 md < gre i it follows that 
neen nien > 1/2. Thus, 





E(x pal ES] n(n— D-n- k+1) EPE 
k=3 k=3 


Then, in the limit as n goes to infinity 


logn 
lim E(x) > lim + 1 > lim (loglogn) = oo. 
4 fuk 


N—>00 N—>00 N—>00 


For p = d/n, d < 1, E (x) converges to a non-zero constant. For d > 1, E(x) converges 
to infinity and a second moment argument shows that graphs will have an unbounded 
number of cycles increasing with n. 


8.4.2 Full Connectivity 


As p increases from p = 0, small components form. At p = 1/n a giant component 
emerges and swallows up smaller components, starting with the larger components and 
ending up swallowing isolated vertices forming a single connected component at p = nr, 
at which point the graph becomes connected. We begin our development with a technical 


lemma. 
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Property Threshold 
cycles 1/n 
giant component 1/n 
giant component 
+ isolated vertices 
connectivity, disappearance Inn 
of isolated vertices 




















diameter two 














Table 2: Thresholds for various properties 


Lemma 8.10 The expected number of connected components of size k in G(n,p) is at 
most 
n 2 
pe A2 1 (q — py en 
( i) pop) 


Proof: The probability that k vertices form a connected component consists of the prod- 
uct of two probabilities. The first is the probability that the k vertices are connected, 
and the second is the probability that there are no edges out of the component to the 
remainder of the graph. The first probability is at most the sum over all spanning trees 
of the k vertices, that the edges of the spanning tree are present. The ”at most” in the 
lemma statement is because G (n, p) may contain more than one spanning tree on these 
nodes and, in this case, the union bound is higher than the actual probability. There are 
k*=2 spanning trees on k nodes. See Section ?? in the appendix. The probability of all the 
k — 1 edges of one spanning tree being present is p*~! and the probability that there are 
no edges connecting the k vertices to the remainder of the graph is (1 — pana. Thus, 
the probability of one particular set of k vertices forming a connected component is at 
most k*=2p*-1 (1 — paa . Thus, the expected number of connected components of size 
k is at most (P)k2p1(1 — py. E 
Inn 


We now prove that for p = z , the giant component has absorbed all small compo- 


nents except for isolated vertices. 


Theorem 8.11 For p = cn with c > 1/2, almost surely there are only isolated vertices 


and a giant component. Forc > 1, almost surely the graph is connected. 


Proof: We prove that almost surely for c > 1/2, there is no connected component with 
k vertices for any k, 2 < k < n/2. This proves the first statement of the theorem since, if 
there were two or more components that are not isolated vertices, both of them could not 
be of size greater than n/2. The second statement that for c > 1 the graph is connected 
then follows from Theorem 8.6 which states that isolated vertices disappear at c = 1. 
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We now show that for p = cu, the expected number of components of size k, 


2<k<n/2, is less than n!72* and thus for c > 1/2 there are no components, except 

for isolated vertices and the giant component. Let x; be the number of connected com- 
2 

ponents of size k. Substitute p = c=! into (;)k*=2p*=1 (1 —p)*"* and simplify using 


(2) < (en/k)*, 1—p<e””, k—1 < k, and x =e" to get 


l 
E(x£k) < exp (mn+k+kmmn-2mk + kine ~ cklnn + kE) ; 


Keep in mind that the leading terms here for large k are the last two and, in fact, at k = n, 

they cancel each other so that our argument does not prove the fallacious statement for 
c > 1 that there is no connected component of size n, since there is. Let 

¿Mn 

FE) =Inn+k+klninn—-2ink+klnc— cklnn + ck*—. 


n 


Differentiating with respect to k, 








2 2ck1 
f'(k)=1+mln-z+mc-cln + epa 
and A 
clnn 
PY'(k) = > 0. 


Thus, the function f(k) attains its maximum over the range [2,n/2] at one of the extreme 
points 2 or n/2. At k = 2, f(2) ~ (1 — 2c)Inn and at k = n/2, f(n/2) ~ —c7 Inn. So 
f(k) is maximum at k = 2. For k = 2, E(x,) = ef is approximately e0729 0n = y 1-2 
and is geometrically falling as k increases from 2. At some point E(x;,) starts to increase 
but never gets above n73”, Thus, the expected sum of the number of components of size 
k, for 2 < k < 1/2 is 

n/2 


E y De = Om 
k=2 


This expected number goes to zero for c > 1/2 and the first-moment method implies that, 
almost surely, there are no components of size between 2 and n/2. This completes the 
proof of Theorem 8.11. E 


8.4.3 Threshold for O(Inn) Diameter 


We now show that within a constant factor of the threshold for graph connectivity, not 
only is the graph connected, but its diameter is O(ln n). That is, if p > cuz for sufficiently 


large constant c, the diameter of G(n,p) is O(Inn) with high probability. 


Consider a particular vertex v. Let S; be the set of vertices at distance 7 from v. We 
argue that as 7 increases, with high probability |S1| +|S3| +---+|5;| grows by at least a 
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factor of two, up to a size of n/1000. This implies that in O(In n) steps, at least n/1000 
vertices are connected to v. Then, there is a simple argument at the end of the proof of 
Theorem 8.13 that a pair of n/1000 sized subsets, connected to two different vertices v 
and w, have an edge between them with high probability. 

Lemma 8.12 Consider G(n, p) for sufficiently large n with p = cnn for any c > 0. Let 
S; be the set of vertices at distance i from some fixed vertex v. If |S1|+|S2|+---+|S;] < 
n/1000, then 

Prob (Sisa | < 2(151| + [Sa] free Ht |Sil)) < e "lsi, 


Proof: Let |S;| = k. For each vertex u not in Sı US23U...US;, the probability that 
u is not in Si, is (1 — p)* and these events are independent. So, |S;,1| is the sum of 
n= (1511 +|52]+-+--+!5;|) independent Bernoulli random variables, each with probability 
of 

[te (1 — p)" Sih e kian/n 


of being one. Note that n — (|S1| + |S2] +---+ |S;|) > 999n/1000. So, 


999n Inn 
EUS3|) > (+e 405. 


Subtracting 200k from each side 


nn k 
E((Si11) — 200k > 2 (1 S65 = 100") 
2 n 
Let a = E and f(a) = 1 — e7” — 400a. By differentiation f”(a) < 0, so f is concave 
and the minimum value of f over the interval [0, 1/1000] is attained at one of the end 
points. It is easy to check that both f(0) and f(1/1000) are greater than or equal to 


zero for sufficiently large n. Thus, f is non-negative throughout the interval proving that 
E(|Si41|) > 200|5;|. The lemma follows from Chernoff bounds. A 


Theorem 8.13 For p > clnn/n, where c is a sufficiently large constant, almost surely, 
G(n,p) has diameter O(Inn). 


Proof: By Corollary 8.2, almost surely, the degree of every vertex is Q(np) = Q(Inn), 
which is at least 201nn for c sufficiently large. Assume that this holds. So, for a fixed 
vertex v, Sı as defined in Lemma 8.12 satisfies |S¡| > 20 1n n. 


Let ¿o be the least ¿ such that |51|+152|+---+|5¡| > n/1000. From Lemma 8.12 and the 
union bound, the probability that for some ¿,1 < i < ig—1, IS] < 2(1511+182]+- - +55!) 
is at most DILO e710k < 1/n4. So, with probability at least 1 — (1/n%), each Si is 
at least double the sum of the previous S; 's, which implies that in O(Inn) steps, io + 1 
is reached. 
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Consider any other vertex w. We wish to find a short O(Inn) length path between 
v and w. By the same argument as above, the number of vertices at distance O(Inn) 
from w is at least n/1000. To complete the argument, either these two sets intersect in 
which case we have found a path from v to w of length O(Inn) or they do not intersect. 
In the latter case, with high probability there is some edge between them. For a pair of 
disjoint sets of size at least n/1000, the probability that none of the possible n?/10% or 
more edges between them is present is at most (1 =p y 0 =e Mninn) There are at most 
227 pairs of such sets and so the probability that there is some such pair with no edges 
is e Arinr)+0() _, 0, Note that there is no conditioning problem since we are arguing 
this for every pair of such sets. Think of whether such an argument made for just the n 
subsets of vertices, which are vertices at distance at most O(Inn) from a specific vertex, 
would work. A 


8.5 Phase Transitions for Increasing Properties 


For many graph properties such as connectivity, having no isolated vertices, having a 
cycle, etc., the probability of a graph having the property increases as edges are added to 
the graph. Such a property is called an increasing property. Q is an ¿increasing property 
of graphs if when a graph G has the property, any graph obtained by adding edges to G 
must also have the property. In this section we show that any increasing property has a 
threshold, although not necessarily a sharp one. 


The notion of increasing property is defined in terms of adding edges. The following 
intuitive lemma proves that if Q is an increasing property, then increasing p in G (n, p) 
increases the probability of the property Q. 


Lemma 8.14 If Q is an increasing property of graphs and O <p < q < 1, then the 
probability that G(n,q) has property Q is greater than or equal to the probability that 
G(n,p) has property Q. 


Proof: This proof uses an interesting relationship between G (n, p) and G (n, q). Generate 
G (n, q) as follows. First generate G (n, p). This means generating a graph on n vertices 


> 1-p 
take the union by including an edge if either of the two graphs has the edge. Call the 
resulting graph H. The graph H has the same distribution as G (n, q). This follows since 


with edge probabilities p. Then, independently generate another graph G (n = and 


the probability that an edge is in H is p+ (1 =p) = q, and, clearly, the edges of H are 
independent. The lemma follows since whenever G (n,p) has the property Q, H also has 
the property Q. A 


We now introduce a notion called replication. An m-fold replication of G(n, p) is a 
random graph obtained as follows. Generate m independent copies of G(n,p) on the 
same set of vertices. Include an edge in the m-fold replication if the edge is in any one 
of the m copies of G(n,p). The resulting random graph has the same distribution as 
G(n, q) where q = 1 — (1 — p)” since the probability that a particular edge is not in the 
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DODO © 


copies of G The m-fold 


If any graph has three or more edges, then the proa 


m-fold replication has three or more edges. 


DODO O 


copies of G The m-fold 
replication H 





Even if no graph has three or more edges, the 
m-fold replication might have three or more edges. 


Figure 8.10: The property that G has three or more edges is an increasing property. 
Let A be the m-fold replication of G. If any copy of G has three or more edges, H has 
three or more edges. However, H can have three or more edges even if no copy of G has 
three or more edges. 


m-fold replication is the product of probabilities that it is not in any of the m copies 
of G(n,p). If the m-fold replication of G(n, p) does not have an increasing property Q, 
then none of the m copies of G(n,p) has the property. The converse is not true. If no 
copy has the property, their union may have it. Since Q is an increasing property and 
q=1-(1=p"<1=(1=mp)=mp 


Prob (G(n, mp) has Q) > Prob (G(n, q) has Q) (8.3) 


We now show that every increasing property Q has a phase transition. The transition 


occurs at the point p(n) at which the probability that G(n, p(n)) has property Q is 3. 
We will prove that for any function asymptotically less then p(n) that the probability of 


having property Q goes to zero as n goes to infinity. 


Theorem 8.15 Each increasing property Q of G(n,p) has a phase transition at p(n), 
where for each n, p(n) is the minimum real number an for which the probability that 
G(n, dn) has property Q is 1/2. 


Proof: Let po(n) be any function such that 


lim poln) 


= 0. 
noo p(n) 
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We assert that almost surely G(n, po) does not have the property Q. Suppose for con- 
tradiction, that this is not true. That is, the probability that G(n, po) has the property 
Q does not converge to zero. By the definition of a limit, there exists € > 0 for which 
the probability that G(n, po) has property Q is at least e on an infinite set J of n. Let 
m = [(1/2)]. Let G(n, q) be the m-fold replication of G(n, po). The probability that 
G(n, q) does not have Q is at most (1 — e)” < e! < 1/2 for all n € I. For these n, by 
(8.3) 
Prob(G(n, mpo) has Q) > Prob(G(n, q) has Q) > 1/2. 


Since p(n) is the minimum real number a, for which the probability that G(n, an) has 
property Q is 1/2, it must be that mpo(n) > p(n). This implies that pa is at least 1/m 


infinitely often, contradicting the hypothesis that lim pa =); 
n—>00 





A symmetric argument shows that for any p,(n) such that lim A w = 0, G(n, pı) 
n—>>00 


almost surely has property Q. E 


8.6 Branching Processes 


A branching process is a method for creating a random tree. Starting with the root 
node, each node has a probability distribution for the number of its children. The root of 
the tree is a parent and its descendants are the children with their descendants being the 
grandchildren. The children of the root are the first generation, their children the second 
generation, and so on. Branching processes have obvious applications in population stud- 
les. 


We analyze a simple case of a branching process where the distribution of the number 
of children at each node in the tree is the same. The basic question asked is what is the 
probability that the tree is finite, i.e., the probability that the branching process dies out? 
This is called the extinction probability. 


Our analysis of the branching process will give the probability of extinction, as well 
as the expected size of the components conditioned on extinction. 


An important tool in our analysis of branching processes is the generating func- 
tion. The generating function for a non-negative integer valued random variable y is 
f (x) = $ pizt where p; is the probability that y equals i. The reader not familiar with 

i=0 
generating functions should consult Section 12.8 of the appendix. 


Let the random variable z; be the number of children in the j} generation and let 
f; (x) be the generating function for z;. Then fı (x) = f (x) is the generating function for 
the first generation where f(x) is the generating function for the number of children at a 
node in the tree. The generating function for the 2" generation is fo(x) = f (f (x)). In 
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general, the generating function for the j + 1% generation is given by fj+1 (x) = f; (f (x)). 
To see this, observe two things. 


First, the generating function for the sum of two identically distributed integer valued 
random variables x; and x2 is the square of their generating function 





P (x) =p + (pop1 + Pipo) £ + (Pope + Pipi 4 Papo) x" AS cud $ 


For zı + 22 to have value zero, both x; and zə must have value zero, for 7; + 12 to have 
value one, exactly one of x; or x2 must have value zero and the other have value one, and 
so on. In general, the generating function for the sum of 7 independent random variables, 
each with generating function f (x), is fi (x). 


The second observation is that the coefficient of x* in f; (x) is the probability of 
there being i children in the j“ generation. If there are i children in the j generation, 
the number of children in the 7 + 1% generation is the sum of 7 independent random 
variables each with generating function f(a). Thus, the generating function for the j +1% 
generation, given 7 children in the jt generation, is f'(x). The generating function for 
the j + 1% generation is given by 


file) = ) Problz; = i) f'(x). 


If f(x) = Y aiz’, then f;,1 is obtained by substituting f(x) for x in f(x). 
i=0 
Since f(x) and its iterates, fo, f3,..., are all polynomials in x with non-negative 
coefficients, f (x) and its iterates are all monotonically increasing and convex on the unit 
interval. Since the probabilities of the number of children of a node sum to one, if pọ < 1, 
some coefficient of x to a power other than zero in f (x) is non-zero and f (x) is strictly 
increasing. 


Let q be the probability that the branching process dies out. If there are i children 
in the first generation, then each of the 7 subtrees must die out and this occurs with 
probability qf. Thus, q equals the summation over all values of i of the product of the 
probability of i children times the probability that i subtrees will die out. This gives 
q = Y <o Pi. Thus, q is the root of x =>) pizt, that is x = f(x). 


This suggests focusing on roots of the equation f(x) = x in the interval [0,1]. The value 

x = 1 is always a root of the equation f (x) = x since f (1) = $` p; = 1. When is there a 
i=0 

smaller non-negative root? The derivative of f (a) at x = 1 is f’(1) = pı +2po+3p3+---. 

Let m = f’(1). Thus, m is the expected number of children of a node. If m > 1, one 

might expect the tree to grow forever, since each node at time 7 is expected to have more 

than one child. But this does not imply that the probability of extinction is zero. In fact, 
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m>l 





Po m = l and pı < 1 








Figure 8.11: Illustration of the root of equation f(x) = x in the interval [0,1). 


if pọ > 0, then with positive probability, the root will have no children and the process 
will become extinct right away. Recall that for G(n, 2), the expected number of children 
is d, so the parameter m plays the role of d. 


If m < 1, then the slope of f(x) at x = 1 is less than one. This fact along with 
convexity of f (x) implies that f (x) > x for x in [0, 1) and there is no root of f(x) = x in 
the interval (0, 1). 


If m = 1 and pı < 1, then once again convexity implies that f(x) > x for x € [0,1) 
and there is no root of f(x) = x in the interval [0, 1). If m = 1 and pı = 1, then f(z) is 
the straight line f(x) = z. 


If m >1, then the slope of f(x) is greater than the slope of x at x = 1. This fact, 
along with convexity of f (x), implies f (x) = x has a unique root in [0,1). When po = 0, 
the root is at z = 0. 


Let q be the smallest non-negative root of the equation f(x) = x. Form < 1 and for 
m=1 and pọ < 1, q equals one and for m >1, q is strictly less than one. We shall see 
that the value of q is the extinction probability of the branching process and that 1 — q is 
the immortality probability. That is, q is the probability that for some j, the number of 
children in the jt” generation is zero. To see this, note that for m > 1, a f; (£) = q for 


0 < x < 1. Figure 8.12 illustrates the proof which is given in Lemma 8.16. Similarly note 
that when m < 1 or m = 1 with po < 1, f; (x) approaches one as j approaches infinity. 


Lemma 8.16 Assume m > 1. Let q be the unique root of f(1)=x in [0,1). In the limit as 
j goes to infinity, f; (x) =q for x in [0, 1). 


Proof: If0< x <q, then x < f(x) < f(q) and iterating this inequality 


x< fılx) < falx) < < file) < fl) =q. 
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Po 











Figure 8.12: Illustration of convergence of the sequence of iterations f(x), fo(x),... to 
q. 


Clearly, the sequence converges and it must converge to a fixed point where f(x) = z. 
Similarly, if q < x < 1, then f(q) < f(x) < x and iterating this inequality 


a> filo) > falo) >---> f(z) > f (q) = 
In the limit as j goes to infinity f; (x) = q for all z, 0 < x <1. That is 


lim f(x) =q+0x +02? +--- 


j>00 


and there are no children with probability q and no finite number of children with prob- 
ability zero. A 


Recall that f; (x) is the generating function > Prob (z; =1) 2*. The fact that in the 


limit the generating function equals the constant « dá and is not a function of x, says that 
Prob (z; = 0) = q and Prob (z; = i) = 0 for all finite non-zero values of i. The remaining 
probability is the probability of a non-finite component. Thus, when m >1, q is the 
extinction probability and 1-q is the probability that z; grows without bound. 


Theorem 8.17 Consider a tree generated by a branching process. Let f(x) be the gener- 
ating function for the number of children at each node. 


1. If the expected number of children at each node is less than or equal to one, then the 
probability of extinction is one unless the probability of exactly one child is one. 


2. If the expected number of children of each node is greater than one, then the proba- 
bility of extinction is the unique solution to f(a) = x in [0, 1). 
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Proof: Let p; be the probability of i children at each node. Then f(x) = po + pix + 
pax? +--+ is the generating function for the number of children at each node and f'(1) = 
pı + 2p2 + 3p3 + --- is the slope of f(x) at x = 1. Observe that f'(1) is the expected 
number of children at each node. 





Since the expected number of children at each node is the slope of f(x) at x = 1, if 
the expected number of children is less than or equal to one, the slope of f(x) at x = 1 
is less than or equal to one and the unique root of f(x) = x in (0, 1] is at x = 1 and the 
probability of extinction is one unless f'(1) = 1 and p, = 1. If f’(1) = 1 and pı = 1, 
f(x) = x and the tree is an infinite degree one chain. If the slope of f(x) at x = 1 is 
greater than one, then the probability of extinction is the unique solution to f(x) = x in 


[0, 1). A 


A branching process can be viewed as the process of creating a component in an infi- 
nite graph. In a finite graph, the probability distribution of descendants is not a constant 
as more and more vertices of the graph get discovered. 


The simple branching process defined here either dies out or goes to infinity. In bio- 
logical systems there are other factors, since processes often go to stable populations. One 
possibility is that the probability distribution for the number of descendants of a child 
depends on the total population of the current generation. 


Expected size of extinct families 


We now show that the expected size of an extinct family is finite, provided that m Æ 1. 
Note that at extinction, the size must be finite. However, the expected size at extinction 
could conceivably be infinite, if the probability of dying out did not decay fast enough. For 
example, suppose that with probability z it became extinct with size 3, with probability 
z it became extinct with size 9, with probability 3 it became extinct with size 27, etc. In 
such a case the expected size at extinction would be infinite even though the process dies 
out with probability one. We now show this does not happen. 


Lemma 8.18 If the slope m = f'(1) does not equal one, then the expected size of an 
extinct family is finite. If the slope m equals one and pı = 1, then the tree is an infinite 
degree one chain and there are no extinct families. If m=1 and pı < 1, then the expected 
size of the extinct family is infinite. 


Proof: Let z; be the random variable denoting the size of the it generation and let q be 
the probability of extinction. The probability of extinction for a tree with k children in 
the first generation is q* since each of the k children has an extinction probability of q. 
Note that the expected size of z1, the first generation, over extinct trees will be smaller 
than the expected size of zı over all trees since when the root node has a larger number 
of children than average, the tree is more likely to be infinite. 
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By Bayes rule 


k 


Prob (extinction|z, = k) q pai 
= Pk q =P. 





Prob (zı = klextinction) = Prob (z = k) Prob (ecinction) 


Knowing the probability distribution of z, given extinction, allows us to calculate the 
expected size of zı given extinction. 


E (z,|extinction) = ` ko = f' (q). 
k=0 


We now prove, using independence, that the expected size of the it” generation given 
extinction is 


E (z;\extinction) = (7 w). 


For i = 2, z3 is the sum of zı independent random variables, each independent of the ran- 
dom variable z1. So, E(z2|z, = j and extinction) = E( sum of j copies of z,|extinction) = 
jE(zı|extinction). Summing over all values of j 


E(zəļextinction) = 


Me 


E(z2|z1 = j and extinction)Prob(z, = j|extinction) 
1 


S: 
ll 


Me 


jE(a|extinction)Prob(z,; = j|extinction) 
1 


es 
ll 


= E(2,Jextinction) ) jProb(a1 = jjextinction) = E?(z,|extinction). 
j=l 


Since E(z,|extinction) = f'(q), E (zoJextinction) = (f' (q))”. Similarly, E (z;Jextinction) = 
(f’ (q))’. The expected size of the tree is the sum of the expected sizes of each generation. 
That is, 


00 


Expected size of = ue z , 8 1 
tree given extinction — 2, E (ziļextinction) = D FUNS P 


1=0 


Thus, the expected size of an extinct family is finite since f’ (q) < 1 provided m # 1. 


The fact that f'(q) < 1 is illustrated in Figure 8.11. If m <1, then q=1 and f'(q) =m 
is less than one. If m >1, then q € [0,1) and again f'(q) <1 since q is the solution to 
f(x) = «x and f'(q) must be less than one for the curve f(x) to cross the line x. Thus, 
for m <1 or m >1, f'(q) <1 and the expected tree size of oo is finite. For m=1 and 
pı < 1, one has q=1 and thus f'(q) = 1 and the formula for the expected size of the tree 
diverges. A 


282 


8.7 CNF-SAT 


Phase transitions occur not only in random graphs, but in other random structures 
as well. An important example is that of satisfiability of Boolean formulas in conjunctive 
normal form. A conjunctive normal form (CNF) formula over n variables 21,...,tp is 
an AND of ORs of literals, where a literal is a variable or its negation. For example, the 
following is a CNF formula over the variables (21,13, 73, £4}: 


(x1 V Xo V £3) (Xo V Ea)(Xy V 14) (3 V La) (22 V T3 V z4). 


Each OR of literals is called a clause; for example, the above formula has five clauses. A 
k-CNF formula is a CNF formula in which each clause has size at most k, so the above 
formula is a 3-CNF formula. An assignment of true/false values to variables is said to 
satisfy a CNF formula if it satisfies every clause in it. Setting all variables to true satisfies 
the above CNF formula, and in fact this formula has multiple satisfying assignments. A 
formula is said to be satisfiable it there exists at least one assignment of truth values to 
variables that satisfies it. 


Many important problems can be converted into questions of finding satisfying as- 
signments of CNF formulas. Indeed, the CNF-SAT problem of whether a given CNF 
formula is satisfiable is NP-Complete, meaning that any problem in the class NP can be 
converted into it. As a result, it is believed to be highly unlikely that there will ever 
exist an efficient algorithm for worst-case instances. However, there are solvers that turn 
out to work very well in practice on instances arising from a wide range of applications. 
There is also substantial structure and understanding of the satisfiability of random CNF 
formulas. The next two sections discuss each in turn. 


8.7.1 SAT-solvers in practice 


While the SAT problem is NP-complete, a number of algorithms have been developed 
that perform extremely well in practice on SAT formulas arising in a range of applica- 
tions. Such applications include hardware and software verification, creating action plans 
for robots and robot teams, solving combinatorial puzzles, and even proving mathematical 
theorems. 


Broadly, there are two classes of solvers: complete solvers and incomplete solvers. Com- 
plete solvers are guaranteed to find a satisfying assignment whenever one exists; if they 
do not return a solution, then you know the formula is not satisfiable. Complete solvers 
are often based on some form of recursive tree search. Incomplete solvers instead make a 
“best effort”; they are typically based on some local-search heuristic, and they may fail 
to output a solution even when a formula is satisfiable. However, they are typically much 
faster than complete solvers. 


An example of a complete solver is the following DPLL (Davis-Putnam-Logemann- 
Loveland) style procedure. First, if there are any variables x; that never appear in negated 
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form in any clause, then set those variables to true and delete clauses where the literal x; 
appears. Similarly, if there are any x; that only appear in negated form, then set those 
variables to false and delete clauses where the literal 2, appears. Second, if there are 
any clauses that have only one literal in them (such clauses are called unit clauses), then 
set that literal as needed to satisfy the clause. E.g., if the clause was “(13)” then one 
would set x3 to false. Then remove that clause along with any other clause containing 
that literal, and shrink any clause containing the negation of that literal (e.g., a clause 
such as (13 V 14) would now become just (14), and one would then run this rule again 
on this clause). Finally, if neither of the above two cases applies, then one chooses some 
literal and recursively tries both settings for it. Specifically, choose some literal / and re- 
cursively check if the formula is satisfiable conditioned on setting £ to true; if the answer 
is “yes” then we are done, but if the answer is “no” then recursively check if the formula 
is satisfiable conditioned on setting £ to false. Notice that this procedure is guaranteed to 
find a satisfying assignment whenever one exists. 


An example of an incomplete solver is the following local-search procedure called 
Walksat. Walksat begins with a random assignment of truth-values to variables. If this 
happens to satisfy the formula, then it outputs success. If not, then it chooses some 
unsatisfied clause C at random. If C contains some variable x; whose truth-value can 
be flipped (causing ČC to be satisfied) without causing any other clause to be unsatisfied, 
then x,'s truth-value is flipped. Otherwise, Walksat either (a) flips the truth-value of the 
variable in C that causes the fewest other clauses to become unsatisfied, or else (b) flips 
the truth-value of a random x; in C; the choice of whether to perform (a) or (b) is deter- 
mined by flipping a coin of bias p. Thus, Walksat is performing a kind of random walk 
in the space of truth-assignments, hence the name. Walksat also has two time-thresholds 
Trips and Trestarts. If the above procedure has not found a satisfying assignment after 
T flips flips, it then restarts with a fresh initial random assignment and tries again; if that 
entire process has not found a satisfying assignment after Tyestarts restarts, then it outputs 
“no assignment found”. 


The above solvers are just two simple examples. Due to the importance of the CNF- 
SAT problem, development of faster SAT-solvers is an active area of computer science 
research. SAT-solving competitions are held each year, and solvers are routinely being 
used to solve challenging verification, planning, and scheduling problems. 


8.7.2 Phase Transitions for CNF-SAT 


We now consider the question of phase transitions in the satisfiability of random k- 
CNF formulas. 


Generate a random CNF formula f with n variables, m clauses, and k literals per 
clause, where recall that a literal is a variable or its negation. Specifically, each clause 
in f is selected independently at random from the set of all (7) 2* possible clauses of size 
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k. Equivalently, to generate a clause, choose a random set of k distinct variables, and 
then for each of those variables choose to either negate it or not with equal probabil- 
ity. Here, the number of variables n is going to infinity, m is a function of n, and k is 
a fixed constant. A reasonable value to think of for k is k = 3. Unsatisfiability is an 
increasing property since adding more clauses preserves unsatisfiability. By arguments 
similar to Section 8.5, there is a phase transition, i.e., a function m(n) such that if m(n) 
is o(m(n)), a random formula with m(n) clauses is, almost surely, satisfiable and for 
m(n) with ma(n)/m(n) > oo, a random formula with ma(n) clauses is, almost surely, 
unsatisfiable. It has been conjectured that there is a constant rą independent of n such 
that rn is a sharp threshold. 


Here we derive upper and lower bounds on rg. It is relatively easy to get an upper 
bound on rz. A fixed truth assignment satisfies a random k clause with probability 
1 — > because of the 2* truth assignments to the k variables in the clause, only one 
fails to satisfy the clause. Thus, with probability y. the clause is not satisfied, and with 
probability 1 — E. the clause is satisfied. Let m = cn. Now, cn independent clauses are 
all satisfied by the fixed assignment with probability (1 — x): Since there are 2” truth 
assignments, the expected number of satisfying assignments for a formula with cn clauses 
is 2” (1 — Ue If c = 2" In 2, the expected number of satisfying assignments is 

gn (1 = ayem , 


2k 


k 
(1 — L is at most 1/e and approaches 1/e in the limit. Thus, 


n k n 
DP a ON tt 
For c > 2" In2, the expected number of satisfying assignments goes to zero as n => 00. 
Here the expectation is over the choice of clauses which is random, not the choice of a 
truth assignment. From the first moment method, it follows that a random formula with 
cn clauses is almost surely not satisfiable. Thus, ry < 2* ln 2. 


The other direction, showing a lower bound for rg, is not that easy. From now on, we 
focus only on the case k = 3. The statements and algorithms given here can be extended 
to k > 4, but with different constants. It turns out that the second moment method 
cannot be directly applied to get a lower bound on rz because the variance is too high. A 
simple algorithm, called the Smallest Clause Heuristic (abbreviated SC), yields a satisfy- 
ing assignment with probability tending to one if e < 2, proving that r3 > 2, Other more 
difficult to analyze algorithms, push the lower bound on r3 higher. 


The Smallest Clause Heuristic repeatedly executes the following. Assign true to a 
random literal in a random shortest clause and delete the clause since it is now satisfied. 
In more detail, pick at random a 1-literal clause, if one exists, and set that literal to 
true. If there is no 1-literal clause, pick a 2-literal clause, select one of its two literals and 
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set the literal to true. Otherwise, pick a 3-literal clause and a literal in it and set the 
literal to true. If we encounter a 0-length clause, then we have failed to find a satisfying 
assignment; otherwise, we have found one. 


A related heuristic, called the Unit Clause Heuristic, selects a random clause with one 
literal, if there is one, and sets the literal in it to true. Otherwise, it picks a random as 
yet unset literal and sets it to true. Another variation is the “pure literal” heuristic. It 
sets a random “pure literal”, a literal whose negation does not occur in any clause, to 
true, if there are any pure literals; otherwise, it sets a random literal to true. 


When a literal w is set to true, all clauses containing w are deleted, since they are 
satisfied, and w is deleted from any clause containing w. If a clause is reduced to length 
zero (no literals), then the algorithm has failed to find a satisfying assignment to the 
formula. The formula may, in fact, be satisfiable, but the algorithm has failed. 


Example: Consider a 3-CNF formula with n variables and cn clauses. With n variables 
there are 2n literals, since a variable and its complement are distinct literals. The expected 
number of times a literal occurs is calculated as follows. Each clause has three literals. 
Thus, each of the 2n different literals occurs Con) = žc times on average. Suppose c = 5. 
Then each literal appears 7.5 times on average. If one sets a literal to true, one would 
expect to satisfy 7.5 clauses. However, this process is not repeatable since after setting a 


literal to true there is conditioning so that the formula is no longer random. A 





Theorem 8.19 If the number of clauses in a random 3-CNF formula grows as cn where 
c is a constant less than 2/3, then with probability 1 — o(1), the Shortest Clause (SC) 
Heuristic finds a satisfying assignment. 


The proof of this theorem will take the rest of the section. A general impediment to 
proving that simple algorithms work for random instances of many problems is condition- 
ing. At the start, the input is random and has properties enjoyed by random instances. 
But, as the algorithm is executed, the data is no longer random; it is conditioned on the 
steps of the algorithm so far. In the case of SC and other heuristics for finding a satisfying 
assignment for a Boolean formula, the argument to deal with conditioning is relatively 
simple. 


We supply some intuition before giving the proof. Imagine maintaining a queue of 1 
and 2-clauses. A 3-clause enters the queue when one of its literals is set to false and it 
becomes a 2-clause. SC always picks a 1 or 2-clause if there is one and sets one of its 
literals to true. At any step when the total number of 1 and 2-clauses is positive, one of 
the clauses is removed from the queue. Consider the arrival rate, that is, the expected 
number of arrivals into the queue at a given time t. For a particular clause to arrive into 
the queue at time t to become a 2-clause, it must contain the negation of the literal being 
set to true at time t. It can contain any two other literals not yet set. The number of 
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such clauses is (15) 2?. So, the probability that a particular clause arrives in the queue at 
time t is at most 
(7392 3 


(Jos = 2(n — 2) 


Since there are cn clauses in total, the arrival rate is %£, which for c < 2/3 is a constant 
strictly less than one. The arrivals into the queue of different clauses occur independently 
(Lemma 8.20), the queue has arrival rate strictly less than one, and the queue loses one or 
more clauses whenever it is non-empty. This implies that the queue never has too many 
clauses in it. A slightly more complicated argument will show that no clause remains as 
a 1 or 2-clause for w(Inn) steps (Lemma 8.21). This implies that the probability of two 


contradictory 1-length clauses, which is a precursor to a 0-length clause, is very small. 





Lemma 8.20 Let T; be the first time that clause i turns into a 2-clause. T; is co if clause 
i gets satisfied before turning into a 2-clause. The T; are mutually independent over the 
randomness in constructing the formula and the randomness in SC, and for any t, 


3 
2(n — 2) 


IA 


Prob(T; = t) 


Proof: For the proof, generate the clauses in a different way. The important thing is 
that the new method of generation, called the method of “deferred decisions”, results in 
the same distribution of input formulae as the original. The method of deferred decisions 
is tied in with the SC algorithm and works as follows. At any time, the length of each 
clause (number of literals) is all that we know; we have not yet picked which literals are 
in each clause. At the start, every clause has length three and SC picks one of the clauses 
uniformly at random. Now, SC wants to pick one of the three literals in that clause to 
set to true, but we do not know which literals are in the clause. At this point, we pick 
uniformly at random one of the 2n possible literals. Say for illustration, we picked Z102. 
The literal 2102 is placed in the clause and set to true. The literal 2102 is set to false. We 
must also deal with occurrences of the literal or its negation in all other clauses, but again, 
we do not know which clauses have such an occurrence. We decide that now. For each 
clause, independently, with probability 3/n include either the literal Z102 or its negation 
Z102, each with probability 1/2. In the case that we included £102 (the literal we had set 
to true), the clause is now deleted, and if we included z102 (the literal we had set to false), 
we decrease the residual length of the clause by one. 


At a general stage, suppose the fates of 1 variables have already been decided and 
n — i remain. The residual length of each clause is known. Among the clauses that are 
not yet satisfied, choose a random shortest length clause. Among the n — i variables 
remaining, pick one uniformly at random, then pick it or its negation as the new literal. 
Include this literal in the clause thereby satisfying it. Since the clause is satisfied, the 
algorithm deletes it. For each other clause, do the following. If its residual length is 
l, decide with probability //(n — i) to include the new variable in the clause and if so 
with probability 1/2 each, include it or its negation. If the literal that was set to true is 
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included in a clause, delete the clause as it is now satisfied. If its negation is included 
in a clause, then just delete the literal and decrease the residual length of the clause by one. 


Why does this yield the same distribution as the original one? First, observe that the 
order in which the variables are picked by the method of deferred decisions is independent 
of the clauses; it is just a random permutation of the n variables. Look at any one clause. 
For a clause, we decide in order whether each variable or its negation is in the clause. So 
for a particular clause and a particular triple i, j, and k with i < 7 < k, the probability 
that the clause contains the i”, the jt”, and k*” literal (or their negations) in the order 
determined by deferred decisions is: 


(a) One) SUS a 
Z ee) eee 


1 1 1 1 = 3 
( ip 5) (1 B ==) pas (1 = a) n=k+1 — n(n—1)(n—2)? 


where the (1 — ---) factors are for not picking the current variable or negation to be 
included and the others are for including the current variable or its negation. Inde- 
pendence among clauses follows from the fact that we have never let the occurrence or 
non-occurrence of any variable in any clause influence our decisions on other clauses. 














Now, we prove the lemma by appealing to the method of deferred decisions to generate 
the formula. T; = t if and only if the method of deferred decisions does not put the current 
literal at steps 1,2,...,t— 1 into the 1% clause, but puts the negation of the literal at 
step t into it. Thus, the probability is precisely 


pila) E ee og 


as claimed. Clearly the T; are independent since again deferred decisions deal with differ- 
ent clauses independently. E 








Lemma 8.21 There exists a constant co such that with probability 1 — o(1), no clause 
remains a 2 or 1-clause for more than calnn steps. l.e., once a 3-clause becomes a 
2-clause, it is either satisfied or reduced to a 0-clause in O(lnn) steps. 


Proof: Say that t is a “busy time” if there exists at least one 2-clause or 1-clause at time 
t, and define a time-window |r + 1,s] to be a “busy window” if time r is not busy but 
then each t € [r + 1,s] is a busy time. We will prove that for some constant cz, with 
probability 1 — o(1), all busy windows have length at most cz Inn. 

Fix some r and s and consider the event that [r +1, s] is a busy window. Since SC always 
decreases the total number of 1 and 2-clauses by one whenever it is positive, we must have 
generated at least s — r new 2-clauses between r and s. Now, define an indicator variable 
for each 3-clause which has value one if the clause turns into a 2-clause between r and 
s. By Lemma 8.20 these variables are independent and the probability that a particular 
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3-clause turns into a 2-clause at a time t is at most 3/(2(n —2)). Summing over t between 
r and s, 





s—r 
Prob (a 3-clause turns into a 2-clause during [r, s]) < a 
nN — 
Since there are cn clauses in all, the expected sum of the indicator variables is cn a x 


Bean, Note that 3c/2 < 1, which implies the arrival rate into the queue of 2 and 1- 


clauses is a constant strictly less than one. Using Chernoff bounds, if s — r > calnn for 
appropriate constant c2, the probability that more than s — r clauses turn into 2-clauses 
between r and s is at most 1/n*. Applying the union bound over all O(n?) possible choices 
of r and s, we get that the probability that any clause remains a 2 or 1-clause for more 
than cy Inn steps is o(1). E 


Now, assume the 1 — o(1) probability event of Lemma 8.21 that no clause remains a 
2 or 1-clause for more than cəlnn steps. We will show that this implies it is unlikely the 
SC algorithm terminates in failure. 


Suppose SC terminates in failure. This means that at some time t, the algorithm 
generates a O-clause. At time t — 1, this clause must have been a 1-clause. Suppose the 
clause consists of the literal w. Since at time t — 1, there is at least one 1-clause, the 
shortest clause rule of SC selects a 1-clause and sets the literal in that clause to true. 
This other clause must have been w. Let tı be the first time either of these two clauses, 
w or w, became a 2-clause. We have t —t, < colnn. Clearly, until time t, neither of these 
two clauses is picked by SC. So, the literals which are set to true during this period are 
chosen independent of these clauses. Say the two clauses were w + xz + y and Ù +u +v 
at the start. x,y,u, and v must all be negations of literals set to true during steps tı to 
t. So, there are only O ((n ny?) choices for x, y, u, and v for a given value of t. There are 
O(n) choices of w, O(n?) choices of which two clauses i and j of the input become these 
w and w, and n choices for t. Thus, there are O (n*(In n)*) choices for what these clauses 
contain and which clauses they are in the input. On the other hand, for any given ¿and 7, 
the probability that clauses 7 and j both match a given set of literals is O(1/n%). Thus the 
probability that these choices are actually realized is therefore O (n‘(Inn)*/n®) = o(1), 
as required. 


8.8 Non-uniform Models of Random Graphs 


So far we have considered the G(n, p) random graph model in which all vertices have 
the same expected degree, and moreover degrees are concentrated close to their expecta- 
tion. However, large graphs occurring in the real world tend to have power law degree 
distributions. For a power law degree distribution, the number f(d) of vertices of degree 
d scales as 1/d% for some constant a > 0. 


289 


Consider a graph in which half of the vertices are degree one and half 

are degree two. If a vertex is selected at random, it is equally likely to be 

—( ) degree one or degree two. However, if we select an edge at random and 

walk to a random endpoint, the vertex is twice as likely to be degree 

two as degree one. In many graph algorithms, a vertex is reached 

ae by randomly selecting an edge and traversing the edge to reach an 

C) endpoint. In this case, the probability of reaching a degree i vertex is 

ae proportional to 7A; where A; is the fraction of vertices that are degree 
i. 

Figure 8.13: Probability of encountering a degree d vertex when following a path in a 

graph. 

One way to generate such graphs is to stipulate that there are f(d) vertices of degree 

d and choose uniformly at random from the set of graphs with this degree distribution. 

Clearly, in this model the graph edges are not independent and this makes these random 

graphs harder to analyze. But the question of when phase transitions occur in random 

graphs with arbitrary degree distributions is still of interest. In this section, we consider 

when a random graph with a non-uniform degree distribution has a giant component. Our 


treatment in this section, and subsequent ones, will be more intuitive without providing 
rigorous proofs. 


8.8.1 Giant Component in Graphs with Given Degree Distribution 


Molloy and Reed address the issue of when a random graph with a non-uniform degree 
distribution has a giant component. Let A; be the fraction of vertices of degree 1. There 
will be a giant component if and only if > i(i — 2)A; > 0. 

¿=0 

To see intuitively that this is the correct formula, consider exploring a component 
of a graph starting from a given seed vertex. Degree zero vertices do not occur except 
in the case where the vertex is the seed. If a degree one vertex is encountered, then 
that terminates the expansion along the edge into the vertex. Thus, we do not want to 
encounter too many degree one vertices. A degree two vertex is neutral in that the vertex 
is entered by one edge and left by the other. There is no net increase in the size of the 
frontier. Vertices of degree i greater than two increase the frontier by i — 2 vertices. The 
vertex is entered by one of its edges and thus there are i — 1 edges to new vertices in the 
frontier for a net gain of i — 2. The iA; in (i — 2) ià; is proportional to the probability of 
reaching a degree i vertex and the ¿i — 2 accounts for the increase or decrease in size of 
the frontier when a degree i vertex is reached. 


Example: Consider applying the Molloy Reed conditions to the G(n, p) model, and use 
p; to denote the probability that a vertex has degree 2, i.e., in analog to A,. It turns out 
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that the summation )~"_, i(i — 2)p; gives value zero precisely when p = 1/n, the point 
at which the phase transition occurs. At p = 1/n, the average degree of each vertex is 
one and there are n/2 edges. However, the actual degree distribution of the vertices is 
binomial, where the probability that a vertex is of degree i is given by p; = (") pP- p. 


We now show that lim Ð i(i — 2)p; = 0 for p; = (")p'(1 — p)""* when p = 1/n. 
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8.9 Growth Models 


Many graphs that arise in the outside world started as small graphs that grew over 
time. In a model for such graphs, vertices and edges are added to the graph over time. 
In such a model there are many ways in which to select the vertices for attaching a new 
edge. One is to select two vertices uniformly at random from the set of existing vertices. 
Another is to select two vertices with probability proportional to their degree. This latter 
method is referred to as preferential attachment. A variant of this method would be to 
add a new vertex at each unit of time and with probability ô add an edge where one 
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end of the edge is the new vertex and the other end is a vertex selected with probability 
proportional to its degree. The graph generated by this latter method is a tree with a 
power law degree distribution. 


8.9.1 Growth Model Without Preferential Attachment 


Consider a growth model for a random graph without preferential attachment. Start 
with zero vertices. At each unit of time a new vertex is created and with probability 6 
two vertices chosen at random are joined by an edge. The two vertices may already have 
an edge between them. In this case, we add another edge. So, the resulting structure is a 
multi-graph, rather then a graph. Since at time t, there are t vertices and in expectation 
only O(6t) edges where there are t? pairs of vertices, it is very unlikely that there will be 
many multiple edges. 


The degree distribution for this growth model is calculated as follows. The number of 
vertices of degree k at time t is a random variable. Let d¿(t) be the expectation of the 
number of vertices of degree k at time t. The number of isolated vertices increases by one 
at each unit of time and decreases by the number of isolated vertices, b(t), that are picked 
to be end points of the new edge. b(t) can take on values 0,1, or 2. Taking expectations, 


dolt +1) = do(t) + 1 — E(b(t)). 


Now b(t) is the sum of two 0-1 valued random variables whose values are the number 
of degree zero vertices picked for each end point of the new edge. Even though the 
two random variables are not independent, the expectation of b(t) is the sum of the 
expectations of the two variables and is 20 a Thus, 
do(t 
dolt +1) = do(t) + 1 — gok 
The number of degree k vertices increases whenever a new edge is added to a degree k — 1 
vertex and decreases when a new edge is added to a degree k vertex. Reasoning as above, 


dy (t +1) = dy(t) 4 O dy. (t) 





— 20 ‘ (8.4) 
t 
Note that this formula, as others in this section, is not quite precise. For example, the 
same vertex may be picked twice, so that the new edge is a self-loop. For k << t, this 
problem contributes a minuscule error. Restricting k to be a fixed constant and letting 
t — œ in this section avoids these problems. 


Assume that the above equations are exactly valid. Clearly, dọ(1) = 1 and d,(1) = 
da(1) =--- = 0. By induction on t, there is a unique solution to (8.4), since given d;(t) 
for all k, the equation determines d;(t + 1) for all k. There is a solution of the form 
d;,(t) = pprt, where pp depends only on k and not on t, provided k is fixed and t > oo. 
Again, this is not precisely true since d,(1) = 0 and d,(2) > 0 clearly contradict the 
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existence of a solution of the form d,(t) = pıt. 


Set dy. (t) = prt. Then, 
t 
(t+ 1) po = pot +1 — 26° 


po = 1 — 20po 
_ 1 
PO 706 


and 





t t 
(t+1) pr = prt + 28E 952 
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Thus, the model gives rise to a graph with a degree distribution that falls off exponentially 
fast with the degree. 





The generating function for component size 


Let nz(t) be the expected number of components of size k at time t. Then n,(t) is 
proportional to the probability that a randomly picked component is of size k. This is 
not the same as picking the component containing a randomly selected vertex (see Figure 
8.14). Indeed, the probability that the size of the component containing a randomly se- 
lected vertex is k is proportional to kn;(t). We will show that there is a solution for n,(t) 
of the form apt where a, is a constant independent of t. After showing this, we focus on 
the generating function g(x) for the numbers ka;(t) and use g(x) to find the threshold 
for giant components. 


Consider n;(t), the expected number of isolated vertices at time t. At each unit of 
time, an isolated vertex is added to the graph and an expected 20m (0) many isolated 
vertices are chosen for attachment and thereby leave the set of isolated vertices. Thus, 


n(t+1)=m(t)+1— ptt) 


For k >1, ny(t) increases when two smaller components whose sizes sum to k are joined 
by an edge and decreases when a vertex in a component of size k is chosen for attachment. 
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Figure 8.14: In selecting a component at random, each of the two components is equally 
likely to be selected. In selecting the component containing a random vertex, the larger 
component is twice as likely to be selected. 





The probability that a vertex selected at random will be in a size k component is Bin) 
Thus, 





ng(t +1) = m(t) +d OIDO _ gg fre 


To be precise, one needs to consider the actual number of components of various sizes, 
rather than the expected numbers. Also, if both vertices at the end of the edge are in the 
same k-vertex component, then n;(t) does not go down as claimed. These small inaccu- 
racies can be ignored. 


Consider solutions of the form nz(t) = apt. Note that n;(t) = agt implies the num- 
ber of vertices in a connected component of size k is kapt. Since the total number of 
vertices at time t is t, ka, is the probability that a random vertex is in a connected 
component of size k. The recurrences here are valid only for k fixed as t oo. So 
Ny Faz may be less than 1, in which case, there are non-finite size components whose 


k—1 
sizes are growing with t. Solving for az yields a, = => and az = Tis Y j(k— j)ajak-j. 
j=l 


Consider the generating function g(x) for the distribution of component sizes where 
the coefficient of x” is the probability that a vertex chosen at random is in a component 
of size k. 


g(x) = y kan”. 
k=l 


Now, g(1) = 7, kay is the probability that a randomly chosen vertex is in a finite sized 
component. For 6 = 0, this is clearly one, since all vertices are in components of size 
one. On the other hand, for 6 = 1, the vertex created at time one has expected degree 
logn (since its expected degree increases by 2/t and X; (2/t) = O(logn)); so, it is in a 
non-finite size component. This implies that for 6 = 1, g(1) < 1 and there is a non-finite 
size component. Assuming continuity, there is a Oeritica, above which g(1) < 1. From the 
formula for the ais, we will derive the differential equation 


g = —26xg' + 26xgg' +x 
and then use the equation for g to determine the value of 6 at which the phase transition 


for the appearance of a non-finite sized component occurs. 
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Derivation of g(x) 





From 
1 
ai === 
ee), 
and 
k- 
¡Ok 
derive the equations 
and 
k-1 
ay (1 +2k8) = 5 ) j(k — j)ajar-j 
j=l 


for k > 2. The generating function is formed by multiplying the k** equation by kx* and 
summing over all k. This gives 


oo oo co k-1 
=o ` kapt” + 252 a apk’ s"! = ô >. kz" Silk — j)ajag-;- 
k=1 k=1 k=1 j=l 
Note that 


g(a) = >. kage” and g'(x) = at. 


k=1 k=1 


Thus, 
—x + g(x) + 2ôxg' (x ADA — j)ajar-;: 


Working with the right hand side 


oo k-1 


¡EE — j)ajak-j = fe dD ERG +k — ¿a ajaj. 


1 j=1 
Now breaking the 7 + k — 7 into two sums gives 


co k-1 oo k-1 


ôr 9 Pagal (k aja y Y jaja — jax ja 


k=1 j=l k=1 j=l 


Notice that the second sum is obtained from the first by substituting k — j for 7 and that 
both terms are dxg'g. Thus, 


—x + g(x) + 2919 (x) = 26xq'(x) g(x). 
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Hence, 
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! 


o1 
9 = 251 
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Phase transition for non-finite components 


The generating function g(x) contains information about the finite components of the 
graph. A finite component is a component of size 1,2,..., which does not depend on t. 


Observe that g(1) = 2 ka, and hence g(1) is the probability that a randomly chosen 


vertex will belong to a Dona of finite size. If g(1) = 1 there are no infinite compo- 
nents. When g(1) 4 1, then 1 — g(1) is the expected fraction of the vertices that are in 
non-finite components. Potentially, there could be many such non-finite components. But 
an argument similar to Part 3 of Theorem ?? concludes that two fairly large components 
would merge into one. Suppose there are two connected components at time t, each of 
size at least t*/°. Consider the earliest created }t*/° vertices in each part. These vertices 
must have lived for at least 14 5 time after creation. At each time, the probability of an 
edge forming between two such vertices, one in each component, is at least $0(172%) and 
so the probability that no such edge formed is at most (1 — IS < ae) 0, 
So with high probability, such components would have merged into one. But this still 
leaves open the possibility of many components of size t*, (Int)?, or some other slowly 
growing function of t. 


We now calculate the value of 6 at which the phase transition for a non-finite compo- 
nent occurs. Recall that the generating function for y (x) satisfies 
g(a) 
g (0) == os 
JE 1-g(x) y 
If ô is greater than some critical, then g(1) 4 1. In this case the above formula at x = 1 


simplifies with 1 — g(1) canceling from the numerator and denominator, leaving just 5 
Since ka, is the probability that a randomly chosen vertex is in a component of size k, 


the average size of the finite components is g’(1) = X k?az. Now, g'(1) is given by 
k=1 


1 
/ 

(1) = 5 (8.6) 
for all ô greater than Oeritical. If 6 is less than Ocriticar, then all vertices are in finite compo- 
nents. In this case g(1) = 1 and both the numerator and the denominator approach zero. 
Appling L’Hopital’s rule 


xg! (a)—g9(a) 
lim SS s A E 
irea] 


or 
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(9'(1))? = 3 (91) — 900). 
The quadratic (9'(1))? — 59'(1) + 59(1) = 0 has solutions 


atya a 1:+yv1-8 (8.7) 
2 7 48 l l 


The two solutions given by (8.7) become complex for ô > 1/8 and thus can be valid only 
for 0 <6 < 1/8. For 6 > 1/8, the only solution is g'(1) = 5 and an infinite component 
exists. As ô is decreased, at 6 = 1/8 there is a singular point where for ô < 1/8 there are 
three possible solutions, one from (8.6) which implies a giant component and two from 
(8.7) which imply no giant component. To determine which one of the three solutions is 
valid, consider the limit as 4 — 0. In the limit all components are of size one since there 


are no edges. Only (8.7) with the minus sign gives the correct solution 


1—V1—85 1- (1— 486 — 1640? + ---) 
46 48 








g (1) = 


g (1)= =1+48+---=1. 








In the absence of any non-analytic behavior in the equation for g'(x) in the region 
0 < 6 < 1/8, we conclude that (8.7) with the minus sign is the correct solution for 
0 < ô< 1/8 and hence the critical value of 6 for the phase transition is 1/8. As we shall 
see, this is different from the static case. 


As the value of 6 is increased, the average size of the finite components increase from 


one to 
1—vV1—86 


48 + 


5=1/8 


when ô reaches the critical value of 1/8. At 9 = 1/8, the average size of the finite com- 
ponents jumps to 5 ie 4 and then decreases as 4 as the giant component swallows 
up the finite components starting with the larger components. 


Comparison to static random graph 


Consider a static random graph with the same degree distribution as the graph in the 
growth model. Again let pz be the probability of a vertex being of degree k. From (8.5) 
(26)" 


Recall the Molloy Reed analysis of random graphs with given degree distributions which 
asserts that there is a phase transition at 5+ i(i — 2)p; = 0. Using this, it is easy to see 


i=0 
that a phase transition occurs for 6 = 1/4. For 6 = 1/4, 
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Figure 8.15: Comparison of the static random graph model and the growth model. The 
curve for the growth model is obtained by integrating g’. 














1y* 1\* 
Pk = co ieee G k G =i Cyr 
mam Tp EG 8M 
and 
IAEA O — 3 VIG) E 373x450. 
i=0 i, os 
Recall that 1+a+a?+---= 5, a+2a?+3a?---= aa» and a+4a?+9a3---= aky, 


See references at end of the chapter for calculating the fractional size Sstatic Of the 
giant component in the static graph. The result is 


0 5< 
S static = ie 1 8> 
+V ô? +28 


Ha] 


8.9.2 Growth Model With Preferential Attachment 


Consider a growth model with preferential attachment. At each time unit, a vertex is 
added to the graph. Then with probability 6, an edge is attached to the new vertex and 
to a vertex selected at random with probability proportional to its degree. This model 
generates a tree with a power law distribution. 


Let d;(t) be the expected degree of the it” vertex at time t. The sum of the expected 
degrees of all vertices at time t is 2ôt and thus the probability that an edge is connected 


to vertex 1 at time t is A The degree of vertex i is governed by the equation 





O on dilt) dilt) 
ai tilt) =0 26t H 
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expected 
degree 











¿— vertex number —> 


Figure 8.16: Illustration of degree of it” vertex at time t. At time t, vertices numbered 
1 to ot have degrees greater than d. 


d;(t) 


gr 18 the probability that 


where ô is the probability that an edge is added at time t and 
the vertex i is selected for the end point of the edge. 


The two in the denominator governs the solution, which is of the form at. The value 


of a is determined by the initial condition d; (t) = ô at t = i. Thus, 6 = ha or a= 01 2. 


Hence, d;(t) = Sy, 


Next, we determine the probability distribution of vertex degrees. Now, d;(t) is less 
E % 62 
than d provided i > St. The fraction s the £ vertices at time t for which ¿ > St and thus 


that the degree is Pr than d is 1 — %. Hence, the probability that a vertex has degree 
less than d is 1 — G. The A density p(d) satisfies 
d 62 
/ p(d)0d = Prob(degree < d) = 1 — z 


and can be obtained from the derivative of Prob(degree < d). 


o og 8? 
pi = 3 (1-5) = 25. 


a power law distribution. 


8.10 Small World Graphs 


In the 1960’s, Stanley Milgram carried out an experiment that indicated that most 
pairs of individuals in the United States were connected by a short sequence of acquain- 
tances. Milgram would ask a source individual, say in Nebraska, to start a letter on its 
journey to a target individual in Massachusetts. The Nebraska individual would be given 
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basic information about the target including his address and occupation and asked to 
send the letter to someone he knew on a first name basis, who was closer to the target 
individual, in order to transmit the letter to the target in as few steps as possible. Each 
person receiving the letter would be given the same instructions. In successful experi- 
ments, it would take on average five to six steps for a letter to reach its target. This 
research generated the phrase “six degrees of separation” along with substantial research 
in social science on the interconnections between people. Surprisingly, there was no work 
on how to find the short paths using only local information. 


In many situations, phenomena are modeled by graphs whose edges can be partitioned 
into local and long-distance. We adopt a simple model of a directed graph due to Klein- 
berg, having local and long-distance edges. Consider a 2-dimensional nxn grid where each 
vertex is connected to its four adjacent vertices via bidirectional local edges. In addition 
to these local edges, there is one long-distance edge out of each vertex. The probability 
that the long-distance edge from vertex u terminates at v, v Æ u, is a function of the 
distance d(u, v) from u to v. Here distance is measured by the shortest path consisting 
only of local grid edges. The probability is proportional to 1/d" (u, v) for some constant r. 
This gives a one parameter family of random graphs. For r equal zero, 1/d? (u, v) = 1 for 
all u and v and thus the end of the long-distance edge at u is uniformly distributed over all 
vertices independent of distance. As r increases the expected length of the long-distance 
edge decreases. As r approaches infinity, there are no long-distance edges and thus no 
paths shorter than that of the lattice path. What is interesting is that for r less than two, 
there are always short paths, but no local algorithm to find them. A local algorithm is an 
algorithm that is only allowed to remember the source, the destination, and its current 
location and can query the graph to find the long-distance edge at the current location. 
Based on this information, it decides the next vertex on the path. 


The difficulty is that for r < 2, the end points of the long-distance edges are too- 
uniformly distributed over the vertices of the grid. Although short paths exist, it is 
unlikely on a short path to encounter a long-distance edge whose end point is close to 
the destination. When r equals two, there are short paths and the simple algorithm that 
always selects the edge that ends closest to the destination will find a short path. For r 
greater than two, again there is no local algorithm to find a short path. Indeed, with high 
probability, there are no short paths at all. 


The probability that the long-distance edge from u goes to v is proportional to 
d™ (u,v). Note that the constant of proportionality will vary with the vertex u depend- 
ing on where u is relative to the border of the n x n grid. However, the number of 
vertices at distance exactly k from u is at most 4k and for k < n/2 is at least k. Let 
c,(u) = $ „d (u,v) be the normalizing constant. It is the inverse of the constant of 
proportionality. 
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Figure 8.17: For r < 2, on a short path you are unlikely to encounter a long-distance 
edge that takes you close to the destination. 


r > 2 The lengths of long distance edges tend to be short so the 
probability of encountering a sufficiently long, long-distance edge is 
too low. 


r = 2 Selecting the edge with end point closest to the destina- 
tion finds a short path. 


r < 2 The ends of long distance edges tend to be uniformly dis- 
tributed. Short paths exist but a polylog length path is unlikely 
to encounter a long distance edge whose end point is close to the 
destination. 


Figure 8.18: Effects of different values of r on the expected length of long-distance edges 
and the ability to find short paths. 
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For r > 2, c,(u) is lower bounded by 


n/2 n/2 
n= de) > Ory be a 
v k=1 k=1 
No matter how large r is the first term of 7"? k!-" is at least one. 


For r = 2 the normalizing constant c,(u) is upper bounded by 


2n 
=e Te < Yan)? < 4X5 y = tan 


For r < 2, the normalizing constant c,(u) is lower bounded by 


n/2 n/2 
E A 
v k=1 k=n/4 
n/2 l—r l-r 
Thesummation ») ki" has 7 terms, the smallest of which is (2) or (2) depending 
k=n/4 


on whether r is greater or less than one. This gives the following lower bound on c (u). 


Cr(u) > 


wn?) = w(n?*). 


io 


No short paths exist for the r > 2 case. 


For r > 2, we first show that for at least half of the pairs of vertices, there is no short 
path pabacen them. We begin by showing that the expected number of edges of length 
greater than n 2 goes to zero. The probability of an edge from u to v is d7 "(u, v)/cr(u) 
where c,(u) is lower bounded by a constant. The probability that a particular edge of 

r+2 r+2 
length greater than or equal to n27 is chosen is upper bounded by cn-(#) for some 
constant c. Since there are n? long edges, the expected number of edges of length at least 
n? is at most cn?n” >> or nz, which for r > 2 goes to zero. Thus, by the first 
moment method, almost surely, there are no such edges. 


For at least half of the pairs of vertices, the grid distance, measured by grid edges 
between the vertices, is greater than or equal to n/4. Any path between them must have 
at least Iinjn = in edges since there are no edges longer than n and so there is 


no polylog length path. 


An algorithm for the r = 2 case 
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< 3k/2 k/2 





Q(k?) vertices at 
distance k/2 from t 


Figure 8.19: Small worlds. 


For r = 2, the local algorithm that selects the edge that ends closest to the destination 
t finds a path of expected length O(Inn)?. Suppose the algorithm is at a vertex u which 
is at distance k from t. Then within an expected O(ln n)? steps, the algorithm reaches a 
point at distance at most k/2. The reason is that there are Q(k?) vertices at distance at 
most k/2 from t. Each of these vertices is at distance at most k+k/2 = O(k) from u. See 
Figure 8.19. Recall that the normalizing constant c, is upper bounded by O(Inn), and 
hence, the constant of proportionality is lower bounded by some constant times 1/ Inn. 
The probability that the long-distance edge from u goes to one of these vertices is at least 


Q(k?k-?/Inn) = Q(1/Inn). 


Consider (Inn)? steps of the path from u. The long-distance edges from the points 
visited at these steps are chosen independently and each has probability Q(1/Inn) of 
reaching within k/2 of t. The probability that none of them does is 

cCnn 2 
(1-Q(1/Inn)) O” = eye = A 
n 
for a suitable choice of constants. Thus, the distance to t is halved every O(In n)? steps 
and the algorithm reaches t in an expected O(Inn)? steps. 


A local algorithm cannot find short paths for the r < 2 case 
For r < 2 no local polylog time algorithm exists for finding a short path. To illustrate 


the proof, we first give the proof for the special case r = 0, and then give the proof for 
EZ: 
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When r = 0, all vertices are equally likely to be the end point of a long-distance edge. 
Thus, the probability of a long-distance edge hitting one of the n vertices that are within 
distance y/n of the destination is 1/n. Along a path of length yn, the probability that 
the path does not encounter such an edge is (1 — 1/n)¥” . Now, 


1\v" 19% ai 
lim [| 1— = = lim [1-== = lime v =1. 
n— o0 n noo n n—>00 


Since with probability 1/2 the starting point is at distance at least n/4 from the desti- 
nation and in yn steps, the path will not encounter a long-distance edge ending within 
distance yn of the destination, for at least half of the starting points the path length will 
be at least y/n. Thus, the expected time is at least ¿yn and hence not in polylog time. 


For the general r < 2 case, we show that a local algorithm cannot find paths of length 
O(n@-")/4), Let 6 = (2 — r)/4 and suppose the algorithm finds a path with at most n° 
edges. There must be a long-distance edge on the path which terminates within distance 
n° of t; otherwise, the path would end in nê grid edges and would be too long. There are 
O(n??) vertices within distance n? of t and the probability that the long-distance edge from 
one vertex of the path ends at one of these vertices is at most n” (25) = n 92. To 
see this, recall that the lower bound on the normalizing constant is 6(n?~") and hence an 
upper bound on the probability of a long-distance edge hitting v is 0 (=) independent 
of where v is. Thus, the probability that the long-distance edge from one of the nê vertices 
on the path hits any one of the n° vertices within distance n° of t is n? a =n 7. 
The probability that this happens for any one of the n? vertices on the path is at most 
nn =n F n T =n-D/1= 9(1) as claimed. 








Short paths exist for r < 2 


Finally we show for r < 2 that there are O(Inn) length paths between s and t. The 
proof is similar to the proof of Theorem 8.13 showing O(Inn) diameter for G(n, p) when 
pis Q(Inn/n), so we do not give all the details here. We give the proof only for the case 
when r =0. 


For a particular vertex v, let S; denote the set of vertices at distance 2 from v. Using 
only local edges, if i is O(Vinn), then |S;| is (Inn). For later i, we argue a constant 
factor growth in the size of S; as in Theorem 8.13. As long as |.9;|+|S2|-+---+].S;| < n?/2, 
for each of the n?/2 or more vertices outside, the probability that the vertex is not in 
Sir is (1 — SIS <1- lea since the long-distance edge from each vertex of S; chooses 
a long-distance neighbor at random. So, the expected size of S;,1 is at least |.S;|/4 and 
using Chernoff, we get constant factor growth up to n?/2. Thus, for any two vertices v 
and w, the number of vertices at distance O(Inn) from each is at least n?/2. Any two 
sets of cardinality at least n?/2 must intersect giving us a O(Inn) length path from v to 
w. 
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8.12 Exercises 


Exercise 8.1 Search the World Wide Web to find some real world graphs in machine 
readable form or data bases that could automatically be converted to graphs. 


AY 


. Plot the degree distribution of each graph. 
Count the number of connected components of each size in each graph. 
Count the number of cycles in each graph. 


Describe what you find. 


What is the average vertex degree in each graph? If the graph were a G(n, p) graph, 
what would the value of p be? 


6. Spot differences between your graphs and G(n,p) for p from Item 5. Look at sizes 
of connected components, cycles, size of giant component. 


Exercise 8.2 In G(n,p) the probability of a vertex having degree k is (ior =p +. 
1. Show by direct calculation that the expected degree is np. 
2. Compute directly the variance of the degree distribution. 


3. Where is the mode of the binomial distribution for a given value of p? The mode is 
the point at which the probability is maximum. 


Exercise 8.3 
1. Plot the degree distribution for G(1000, 0.003). 


2. Plot the degree distribution for G(1000, 0.030). 


Exercise 8.4 To better understand the binomial distribution plot (pe — p)" as a 
function of k for n = 50 and k = 0.05, 0.5, 0.95. For each value of p check the sum over 
all k to ensure that the sum is one. 


Exercise 8.5 In G (n, +) , argue that with high probability there is no vertex of degree 


aan (i.e.,the probability that such a vertex exists goes to zero as n goes 


to infinity). You may use the Poisson approximation and may wish to use the fact that 
k! > (By. 





greater than 


Exercise 8.6 The example of Section 8.1.1 showed that if the degrees in G(n, +) were 


independent there would almost surely be a vertex of degree Q(logn/loglogn). However, 
the degrees are not independent. Show how to overcome this difficulty. 
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Exercise 8.7 Let f (n) be a function that is asymptotically less than n. Some such func- 
1 


tions are 1/n, a constant d, logn or n3. Show that 


( sis fu)" ~ ef mol). 





n 


for large n. That is 





Exercise 8.8 


aan 


1. In the limit as n goes to infinity, how does (1 F behave. 


2. What is lim (22)? 
n—>00 


Exercise 8.9 Consider a random permutation of the integers 1 to n. The integer i is 
said to be a fixed point of the permutation if i is the integer in the i position of the 
permutation. Use indicator variables to determine the expected number of fixed points in 
a random permutation. 

Exercise 8.10 Generate a graph G (n, 2) with n = 1000 and d=2, 3, and 6. Count the 


number of triangles in each graph. Try the experiment with n=100. 
Exercise 8.11 What is the expected number of squares (4-cycles) in G (n, 2) ? What is 


the expected number of 4-cliques in G (n, 2) ? A 4-clique consists of four vertices with all 


(5) edges present. 


Exercise 8.12 Carry out an argument, similar to the one used for triangles, to show that 
p= 375 is a threshold for the existence of a 4-clique. A 4-clique consists of four vertices 
with all (5) edges present. 


Exercise 8.13 What is the expected number of simple paths of length 3, logn, yn, and 
n—1 in Gín, 2)? A simple path is a path where no vertex appears twice as in a cycle. 
The expected number of simple paths of a given length being infinite does not imply that a 
graph selected at random has such a path. 


Exercise 8.14 Let x be an integer chosen uniformly at random from {1,2,...,n}. Count 
the number of distinct prime factors ofn. The exercise is to show that the number of prime 
factors almost surely is O(Inlnn). Let p stand for a prime number between 2 and n. 


1. For each fixed prime p, let I, be the indicator function of the event that p divides x. 
Show that E(L,) = a +0 (2). 
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2. The random variable of interest, y = » Ip, is the number of prime divisors of x 


Pp 
picked at random. Show that the variance of y is O(InInn). For this, assume the 
known result that the number of primes p between 2 and n is O(n/Inn) and that 
DE =lnlnn. To bound the variance of y, think of what E(IpIq) is for p H q, both 
p 


primes. 
3. Use (1) and (2) to prove that the number of prime factors is almost surely 0(ln lnn). 


Exercise 8.15 Suppose one hides a clique of size k in a random graph G (n, 3). Le., in 
the random graph, choose some subset S of k vertices and put in the missing edges to make 
S a clique. Presented with the modified graph, the goal is to find S. The larger S is, the 
easier it should be to find. In fact, if k is more than cvnlnn, then with high probability 
the clique leaves a telltale sign identifying S as the k vertices of largest degree. Prove this 
statement by appealing to Theorem 8.1. It remains a puzzling open problem to find such 
hidden cliques when k is smaller, say, O(n’). 


Exercise 8.16 The clique problem in a graph is to find the maximal size clique. This 
problem is known to be NP-hard and so a polynomial time algorithm is thought unlikely. 
We can ask the corresponding question about random graphs. For example, in G (n, 1) 
there almost surely is a clique of size (2 — e) logn for any e > 0. But it is not known how 
to find one in polynomial time. 


1. Show that in G(n, 3) there almost surely are no cliques of size greater than or equal 
to 2 log, n. 


1 


2. Use the second moment method to show that in G(n, 5 


cliques of size (2 — e) log, n. 


), almost surely there are 


3. Show that for any e > 0, a clique of size (2—e)logn can be found in G (n, 1) in 


time non) if one exists. 


4. Give an O(n?) algorithm that finds a clique of size Q (logn) in G(n, 3) with high 
probability. Hint: use a greedy algorithm. Apply your algorithm to G (1000, 3). 
What size clique do you find? 


5. An independent set in a graph is a set of vertices such that no two of them are 
connected by an edge. Give a polynomial time algorithm for finding an independent 
set in G (n, 1) of size Q (logn) with high probability. 


Exercise 8.17 Suppose H is a fixed graph on cn vertices with e? (log n)? edges. Show 
that if c > 2, with high probability, H does not occur as a vertex-induced subgraph of 
G(n,1/4). In other words, there is no subset of cn vertices of G such that the graph G 
restricted to these vertices is isomorphic to H. Or, equivalently, for any subset S of cn 
vertices of G and any 1-1 mapping f between these vertices and the vertices of H, there is 
either an edge (i, j) within S such that the edge (f (1), F(5)) does not exist in H or there 
is a non-edge i, j in S such that (f(t), F(5)) does exist in H. 
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Exercise 8.18 Given two instances, G, and G2 of G(n, 3), consider the size of the largest 
vertex-induced subgraph common to both G, and Ga. In other words, consider the largest 
k such that for some subset Sı of k vertices of Gy and some subset So of k vertices of Go, 
the graph G restricted to Sı is isomorphic to the graph Ga restricted to Sy. Prove that 
with high probability, k < 4 log, n. 


Exercise 8.19 (Birthday problem) What is the number of integers that must be drawn 
with replacement from a set of n integers so that some integer, almost surely, will be 
selected twice? 


Exercise 8.20 Suppose you have an algorithm for finding communities in a social net- 
work. Assume that the way the algorithm works is that given the graph G for the social 
network, it finds a subset of vertices satisfying a desired property P. The specifics of prop- 
erty P are unimportant for this question. If there are multiple subsets S of vertices that 
satisfy property P, assume that the algorithm finds one such set S at random. 

In running the algorithm you find thousands of communities and wonder how many 
communities there are in the graph. Finally, when you find the 10,000 community, it is 
a duplicate. It is the same community as one found earlier. Use the birthday problem to 
derive an estimate of the total number of communities in G. 


Exercise 8.21 Do a breadth first search in G(n, 2) with d > 1 starting from some vertex. 
The number of discovered vertices, z;, after i steps has distribution Binomial(n, p;) where 
pi =1- (1 — ay" If the connected component containing the start vertex has 1 vertices, 
then zi = i. Show that as n — co (and d is a fixed constant), Prob(z; = i) is o(1/n) unless 
1<cilnn ori > con for some constants ci, C2. 


Exercise 8.22 For f(x) =1—e-“ —2, what is the value of tmar = arg max f(x)? What 
is the value of f(£mar)? Recall from the text that in a breadth first search of G(n, 2), f(x) 
is the expected normalized size of the frontier (size of frontier divided by n) at normalized 
time x (x = t/n). Where does the maximum expected value of the frontier of a breadth 
search in G(n, 2) occur as a function of n? 


Exercise 8.23 Generate a random graph on 50 vertices by starting with the empty graph 

and then adding random edges one at a time. How many edges do you need to add until 

cycles first appear (repeat the experiment a few times and take the average)? How many 

edges do you need to add until the graph becomes connected (repeat the experiment a few 

times and take the average)? 

Exercise 8.24 Consider G(n, p) with p = +. 

1. Use the second moment method to show that with high probability there exists a 
simple path of length 10. In a simple path no vertex appears twice. 


2. Argue that on the other hand, it is unlikely there exists any cycle of length 10. 
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Exercise 8.25 Complete the second moment argument of Theorem 8.9 to show that for 
Ds 2, d > 1, G(n,p) almost surely has a cycle. Hint: If two cycles share one or more 
edges, then the union of the two cycles is at least one greater than the union of the vertices. 


Inn 


Exercise 8.26 What is the expected number of isolated vertices in G(n,p) for p = jen 


as a function of n? 


Exercise 8.27 Theorem 8.13 shows that for some c > 0 and p = clnn/n, G(n,p) has 
diameter O (Inn). Tighten the argument to pin down as low a value as possible for c. 


Exercise 8.28 What is diameter of G(n,p) for various values of p? Remember that the 
graph becomes fully connected at nn and has diameter two at Joie. 


Exercise 8.29 
1. List five increasing properties of G (n, p). 
2. List five non-increasing properties . 


Exercise 8.30 Ify and z are independent, non-negative, integer valued random variables, 
then the generating function of the sum y + z is the product of the generating function of 
y and z. Show that this follows from E(x¥**) = E(a¥x*) = E(x¥) E(x?). 


Exercise 8.31 Let f;(x) be the j iterate of the generating function f(x) of a branch- 
ing process. When m > 1, limj.of)(x) = q for0<x<l1. In the limit this implies 
Prob(z;=0) = q and Prob(z; = i) = 0 for all non-zero finite values of i. Shouldn't the 
probabilities add up to 1? Why is this not a contradiction? 


Exercise 8.32 Try to create a probability distribution for a branching process which 
varies with the current population in which future generations neither die out, nor grow 
to infinity. 


Exercise 8.33 Consider generating the edges of a random graph by flipping two coins, 
one with probability pı of heads and the other with probability pz of heads. Add the edge 
to the graph if either coin comes down heads. What is the value of p for the generated 
G(n, p) graph? 


Exercise 8.34 In the proof of Theorem 8.15 that every increasing property has a thresh- 


old, we proved for po(n) such that lim bolt) = 0 that G(n, po) almost surely did not have 
n—>00 
property Q. Give the symmetric argument that for any pi(n) such that lim pun =); 
n—>>00 


G(n, pi) almost surely has property Q. 


Exercise 8.35 Consider a model of a random subset N(n, p) of integers {1,2,...n} de- 
fined by independently at random including each of {1,2,...n} into the set with probability 
p. Define what an “increasing property” of N(n,p) means. Prove that every increasing 
property of N(n,p) has a threshold. 
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Exercise 8.36 N(n,p) is a model of a random subset of integers {1,2,...n} defined by 
independently at random including each of {1,2,...n} into the set with probability p. 
What is the threshold for N (n,p) to contain 


1. a perfect square, 

2. a perfect cube, 

3. an even number, 

4. three numbers such that z +y =z ? 


Exercise 8.37 Explain why the property that N (n, p) contains the integer 1 has a thresh- 
old. What is the threshold? 


Exercise 8.38 The Sudoku game consists of a 9 x 9 array of squares. The array is 
partitioned into nine 3 x 3 squares. Each small square should be filled with an integer 
between 1 and 9 so that each row, each column, and each 3 x 3 square contains exactly 
one copy of each integer. Initially the board has some of the small squares filled in in such 
a way that there is exactly one way to complete the assignments of integers to squares. 
Some simple rules can be developed to fill in the remaining squares such as if a row does 
not contain a given integer and if every column except one in which the square in the row 
is blank contains the integer, then place the integer in the remaining blank square in the 
row. Explore phase transitions for the Sudoku game. Some possibilities are: 


1. Start with a 9 x 9 array of squares with each square containing a number between 
1 and 9 such that no row, column, or 3 x 3 square has two copies of any integer. 
Develop a set of simple rules for filling in squares such as if a row does not contain 
a given integer and if every column except one in which the square in the row is 
blank contains the integer, then place the integer in the remaining blank entry in the 
row. How many integers can you randomly erase and your rules will still completely 
fill in the board? 


2. Generalize the Sudoku game for arrays of size n? x n?. Develop a simple set of 
rules for completing the game. Start with a legitimate completed array and erase k 
entries at random. Experimentally determine the threshold for the integer k such 
that if only k entries of the array are erased, your set of rules will find a solution? 


Exercise 8.39 In a square n x n grid, each of the O(n?) edges is randomly chosen to 
be present with probability p and absent with probability 1 — p. Consider the increasing 
property that there is a path from the bottom left corner to the top right corner which 
always goes to the right or up. Show that p = 1/2 is a threshold for the property. Is it a 
sharp threshold? 


Exercise 8.40 The threshold property seems to be related to uniform distributions. What 
if we considered other distributions? Consider a model where i is selected from the set 
{1,2,...,n} with probability proportional to E, Is there a threshold for perfect squares? 


Is there a threshold for arithmetic progressions? 
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Exercise 8.41 Modify the proof that every increasing property of G(n, p) has a threshold 
to apply to the 3-CNF satisfiability problem. 


k 
Exercise 8.42 Evaluate (1 — iy for k=3, 5, and 7. How close is it to 1/e? 
Exercise 8.43 For a random 3-CNF formula with n variables and cn clauses for some 
constant c, what is the expected number of satisfying assignments? 


Exercise 8.44 Which of the following variants of the SC algorithm admit a theorem like 
Theorem 8.20? 


1. Among all clauses of least length, pick the first one in the order in which they appear 
in the formula. 


2. Set the literal appearing in most clauses independent of length to 1. 


Exercise 8.45 Suppose we have a queue of jobs serviced by one server. There is a total 
of n jobs in the system. At time t, each remaining job independently decides to join the 
queue to be serviced with probability p = d/n, where d < 1 is a constant. Each job has 
a processing time of 1 and at each time the server services one job, if the queue is non- 
empty. Show that with high probability, no job waits more than Q(Inn) time to be serviced 
once it joins the queue. 


Exercise 8.46 Consider G(n,p). Show that there is a threshold (not necessarily sharp) 
for 2-colorability at p = 1/n. In particular, first show that for p = d/n with d < 1, with 
high probability G(n,p) is acyclic, so it is bipartite and hence 2-colorable. Next, when 
pn — œ, the expected number of triangles goes to infinity. Show that in that case, there 
is a triangle almost surely and therefore almost surely the graph is not 2-colorable. 


Exercise 8.47 A vertex cover of size k for a graph is a set of k vertices such that one end 
of each edge is in the set. Experimentally play with the following problem. For G(20, 3), 
for what value of k is there a vertex cover of size k? 


Exercise 8.48 Construct an example of a formula which is satisfiable, but the SC heuris- 
tic fails to find a satisfying assignment. 


Exercise 8.49 In G(n, p), let x, be the number of connected components of size k. Using 
Lp, write down the probability that a randomly chosen vertex is in a connected component 
of size k. Also write down the expected size of the connected component containing a 
randomly chosen vertex. 


Exercise 8.50 Describe several methods of generating a random graph with a given degree 
distribution. Describe differences in the graphs generated by the different methods. 
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Exercise 8.51 Consider generating a random graph adding one edge at a time. Let n(i,t) 
be the number of components of size 1 at time t. 


n(1,1) = 
n(1,1)=0 t>1 
Aeon ¡NA is i=5t=1)=En() 


Compute n(i,t) for a number of values of i and t. What is the behavior? What is the 
sum of n(1,t) for fixed t and all i? Can you write a generating function for n(i,t)? 


Exercise 8.52 In the growth model with preferential attachment where at time t a single 
edge is added from the new vertex to an existing vertex with probability ô, the resulting 
graph consists of a set of trees. To generate a more complex graph, at each unit of time 
add d edges each with probability 6. Derive the degree distribution of the generated graph. 


Solution: A 


Exercise 8.53 The global clustering coefficient of a graph is defined as follows. Let d, be 
the degree of vertex v and let e, be the number of edges connecting pairs of vertices that 
are adjacent to verter v. The global clustering coefficient c is given by 


= ences dy(dy—1) Geni)’ 


In a social network, for example, it measures what fraction of pairs of friends of each 
person are themselves friends. If many are, the clustering coefficient is high. What is c 
for a random graph with p = 2 in the limit as n goes to infinity? For a denser graph? 
Compare this value to that for some social network. 


Exercise 8.54 Consider a structured graph, such as a grid or cycle, and gradually add 
edges or reroute edges at random. Let L be the average distance between all pairs of 
vertices in a graph and let C be the ratio of triangles to connected sets of three vertices. 
Plot L and C as a function of the randomness introduced. 


Exercise 8.55 Consider an n x n grid in the plane. 


1. Prove that for any vertex u, there are at least k vertices at distance k for1<k< 
n/2. 


2. Prove that for any vertex u, there are at most 4k vertices at distance k. 


3. Prove that for one half of the pairs of points, the distance between them is at least 
n/4. 


Exercise 8.56 Recall the definition of a small-world graph in Section 8.10. Show that 
in a small-world graph with r < 2, that there exist short paths with high probability. The 
proof for r = 0 is in the text. 
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Exercise 8.57 Change the small worlds graph as follows. Start with an x n grid where 
each vertex has one long-distance edge to a vertex chosen uniformly at random. These are 
exactly like the long-distance edges for r = 0. Instead of having grid edges, we have some 
other graph with the property that for each vertex, there are O(t?) vertices at distance t 
from the vertex fort <n. Show that, almost surely, the diameter is O(Inn). 


Exercise 8.58 Consider an n-node directed graph with two random out-edges from each 
node. For two vertices s and t chosen at random, prove that with high probability there 
exists a path of length at most O(Inn) from s to t. 


Exercise 8.59 Explore the concept of small world by experimentally determining the an- 
swers to the following questions: 


1. How many edges are needed to disconnect a small world graph? By disconnect we 
mean at least two pieces each of reasonable size. Is this connected to the emergence 
of a giant component? 


2. How does the diameter of a graph consisting of a cycle change as one adds a few 
random long-distance edges? 


Exercise 8.60 In the small world model with r < 2, would it help if the algorithm could 
look at edges at any node at a cost of one for each node looked at? 


Exercise 8.61 Make a list of the ten most interesting things you learned about random 
graphs. 
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9 Topic Models, Non-negative Matrix Factorization, 
Hidden Markov Models, and Graphical Models 


In the chapter on machine learning, we saw many algorithms for fitting functions to 
data. For example, suppose we want to learn a rule to distinguish spam from non-spam 
email and we were able to represent email messages as points in R? such that the two 
categories are linearly separable. Then, we could run the Perceptron algorithm to find a 
linear separator that correctly partitions our training data. Furthermore, we could argue 
that if our training sample was large enough, then with high probability, this translates to 
high accuracy on future data coming from the same probability distribution. An interest- 
ing point to note here is that these algorithms did not aim to explicitly learn a model of 
the distribution D* of spam emails or the distribution D” of non-spam emails. Instead, 
they aimed to learn a separator to distinguish spam from non-spam. In this chapter, we 
look at algorithms that, in contrast, aim to explicitly learn a probabilistic model of the 
process used to generate the observed data. This is a more challenging problem, and 
typically requires making additional assumptions about the generative process. For ex- 
ample, in the chapter on high-dimensional space, we assumed data came from a Gaussian 
distribution and we learned the parameters of the distribution. In the chapter on SVD, 
we considered the more challenging case that data comes from a mixture of k Gaussian 
distributions. For k = 2, this is similar to the spam detection problem, but harder in that 
we are not told which training emails are spam and which are non-spam, but easier in 
that we assume D* and D” are Gaussian distributions. In this chapter, we examine other 
important model-fitting problems, where we assume a specific type of process is used to 
generate data, and then aim to learn the parameters of this process from observations. 


9.1 Topic Models 


Topic Modeling is the problem of fitting a certain type of stochastic model to a given 
collection of documents. The model assumes there exist r “topics”, that each document is 
a mixture of these topics, and that the topic mixture of a given document determines the 
probabilities of different words appearing in the document. For a collection of news arti- 
cles, the topics may be politics, sports, science, etc. A topic is a set of word frequencies. 
For the topic of politics, words like “president” and “election” may have high frequencies, 
whereas for the topic of sports, words like “pitcher” and “goal” may have high frequencies. 
A document (news item) may be 60% politics and 40% sports. In that case, the word 
frequencies in the document are assumed to be convex combinations of word frequencies 
for each of these topics with weights 0.6 and 0.4 respectively. 


Each document is viewed as a “bag of words” or terms.*". Namely, we disregard the 
order and context in which each word occurs in the document and instead only list the 
frequency of occurrences of each word. Frequency is the number of occurrences of the 





37In practice, terms are typically words or phrases, and not all words are chosen as terms. For example, 
articles and simple verbs, pronouns etc. may not be considered terms. 
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word divided by the total count of all words in the document. Throwing away context 
information may seem wasteful, but this approach works fairly well in practice. Each doc- 
ument is a vector with d components where d is the total number of different terms that 
exist; each component of the vector is the frequency of a particular term in the document. 


We can represent a collection of n documents by a d x n matrix A called the term- 
document matrix, with one column per document and one row per term. The topic model 
hypothesizes that there exist r topics (r is typically small) such that each document is 
a mixture of these topics. In particular, each document has an associated vector with r 
non-negative components summing to one, telling us the fraction of the document that is 
on each of the topics. In the example above, this vector would have 0.6 in the component 
for politics and 0.4 in the component for sports. These can be arranged vectors as the 
columns of a r x n matrix C, called the topic-document matrix. Finally, there is a third 
d x r matrix B for the topics. Each column of B is a vector corresponding to one topic; 
it is the vector of expected frequencies of terms in that topic. The vector of expected 
frequencies for a document is a convex combination of the expected frequencies for topics, 
with the topic weights given by the vector in C for that document. In matrix notation, 
let P be an x d matrix with column P(:,7) denoting the expected frequencies of terms 
in document j. Then, 

P= BC. (9.1) 


Pictorially, we can represent this as: 


D OC UMEN T TOPIC 


4 
4 


DOCUMENT 


J 5 
Y 
II 
E 
w 
Qa Org 
Q 


= 
= 


Topic Models are stochastic models that generate documents according to the fre- 
quency matrix P above. p;; is viewed as the probability that a random term of document 
j is the i” term in the dictionary. We make the assumption that terms in a document are 
drawn independently. In general, B is assumed to be a fixed matrix, whereas C is random. 
So, the process to generate n documents, each containing m terms, is the following: 


Definition 9.1 (Document Generation Process) Let D be a distribution over a miz- 
ture of topics. Let B be the term-topic matrix. Create ad x n term-document matrix A 
as follows: 


e Intialize t= 0 fort = 1,2,...,d; j = 1,2,..., n. 





38We will use i to index into the set of all terms, j to index documents and / to index topics. 
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e For j = 1,2,...,n 


— Pick column j of C from distribution D. This will be the topic mixture for 
document j, and induces P(:, j) = BC(:, j). 


— POT = 1,2,...,m, do: 


x Generate the t term x, of document j from the multinomial distribution 
over {1,2,...,d} with probability vector P(:, j) i.e., Prob(x, = i) = pij. 
* Add 1/m to az, j- 


The topic modeling problem is to infer B and C from A. The probability distribution 
D, of the columns of C is not yet specified. The most commonly used distribution is the 
Dirichlet distribution that we study in detail in Section 9.6. 


Often we are given fewer terms of each document than the number of terms or the 
number of documents. Even though 


E(aij|P) = Pij, (9.2) 


and in expectation A equals P, the variance is high. For example, for the case when 
Pij = 1/d for all i with m much less than Vd, A(:,7) is likely to have 1/m in a random 
subset of m coordinates since no term is likely to be picked more than once. Thus 


AG) ~ PGA =m (= ~ 5) + =m (5) =2 


the maximum possible. This says that in lı norm, which is the right norm when dealing 
with probability vectors, the “noise” a.; — p.j is likely to be larger than p.j. This is one 
of the reasons why the model inference problem is hard. Write 


A=BCHN, (9.3) 


where, A is the d x n term-document matrix, B is a d x r term-topic matrix and C’ is 
ar x n topic-document matrix. N stands for noise, which can have high norm. The J, 
norm of each column of N could be as high as that of BC. 


There are two main ways of tackling the computational difficulty of finding B and C 
from A. One is to make assumptions on the matrices B and C that are both realistic and 
also admit efficient computation of B and C. The trade-off between these two desirable 
properties is not easy to strike and we will see several approaches beginning with the 
strongest assumptions on B and C in Section 9.2. The other way is to restrict N. Here 
again, an idealized way would be to assume N = 0 which leads to what is called the 
Non-negative Matrix Factorization (NMF) (Section 9.3) problem of factoring the given 
matrix A into the product of two non-negative matrices B and C. With a further restric- 
tion on B, called Anchor terms, (Section 9.4), there is a polynomial time algorithm to 
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do NMF. The strong restriction of N = 0 can be relaxed (Section ??), but at the cost of 
computational efficiency. 


The most common approach to topic modeling makes an assumption on the probabil- 
ity distribution of C, namely, that the columns of C are independent Dirichlet distributed 
random vectors. This is called the Latent Dirichlet Allocation model (Section 9.6), which 
does not admit an efficient computational procedure. We show that the Dirichlet distri- 
bution leads to many documents having a “primary topic,” whose weight is much larger 
than average in the document. This motivates a model called the “Dominant Admixture 
model” (Section 9.7) which admits an efficient algorithm. 


On top of whatever other assumptions are made, we assume that in each document, 
the m terms in it are drawn independently as in Definition 9.1. This is perhaps the biggest 
assumption of all. 


9.2 An Idealized Model 


The Topic Model inference problem is in general computationally hard. But under 
certain reasonable assumptions, it can be solved in polynomial time as we will see in this 
chapter. We start here with a highly idealized model that was historically the first for 
which a polynomial time algorithm was devised. In this model, we make two assumptions: 


The Pure Topic Assumption: Each document is purely on a single topic. I.e., each 
column j of C has a single entry equal to 1, and the rest of the entries are 0. 


Separability Assumption: The sets of terms occurring in different topics are disjoint. 
I.e., for each row i of B, there is a unique column / with bu 4 0. 


Under these assumptions, the data matrix A has a block structure. Let T, denote the 
set of documents on topic l and S; the set of terms occurring in topic l. After rearranging 
columns and rows so that the rows in each S; occur consecutively and the columns of each 
T occur consecutively, the matrix A looks like: 


TOPIC 
Tı Tə T3 

x x x» 0. 0 0 0 0 O T 
Sif *« * x 0 0 0 0 0 0 

x x x 0 0 0 0 0 0 E 

0 0 0 x x x 0 0 0 

A= 

So|0 0 0 * x x 0 0 0 R 

0 0 0 x x x 0 0 0 

0 0 0 0 0 0 * x * M 
S3|0 0 0 0 0 0 x * «x 

0 0 0 0 0.0 xk x x 


318 


If we can partition the documents into r clusters, T1, T2,..., Tp, one for each topic, 
we can take the average of each cluster and that should be a good approximation to the 
corresponding column of B. It would also suffice to find the sets S, of terms, since from 
them we could read off the sets T, of topics. We now formally state the document gen- 
eration process under the Pure Topic Assumption and the associated clustering problem. 
Note that under the Pure Topics Assumption, the distribution D over columns of C is 
specified by the probability that we pick each topic to be the only topic of a document. 
Let a1, Q@2,...,a, be these probabilities. 


Document Generation Process under Pure Topics Assumption: 


e Intialize all aj; to zero. 
e For each document do 


— Select a topic from the distribution given by {a1, a2,...,a;,}. 
— Select m words according to the distribution for the selected topic. 
— For each selected word add 1/m to the document-term entry of the matrix A. 


Definition 9.2 (Clustering Problem) Given A generated as above and the number of 
topics r, partition the documents {1,2,...,n} into r clusters T,,T>,...,T;,, each specified 
by a topic. 

Approximate Version: Partition the documents into r clusters, where at most en of 
the j € {1,2,...,n} are misclustered. 


The approximate version of Definition 9.2 suffices since we are taking the average of 
the document vectors in each cluster 7 and returning the result as our approximation to 
column j of B. Note that even if we clustered perfectly, the average will only approximate 
the column of B. We now show how we can find the term clusters Sj, which then can be 
used to solve the Clustering Problem. 


Construct a graph G on d vertices, with one vertex per term, and put an edge between 
two vertices if they co-occur in any document. By the separability assumption, we know 
that there are no edges between vertices belonging to different Sı. This means that if each 
Sı is a connected component in this graph, then we will be done. Note that we need to 
assume m > 2 (each document has at least two words) since if all documents have just 
one word, there will be no edges in the graph at all and the task is hopeless. 


Let us now focus on a specific topic l and ask how many documents n; we need so that 
with high probability, Sı is a connected component. One annoyance here is that some 
words may have very low probability and not become connected to the rest of Sı. On the 
other hand, words of low probability can’t cause much harm since they are unlikely to be 
the only words in a document, and so it doesn’t matter that much if we fail to cluster 
them. We make this argument formal here. 
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Let y < 1/3 and define e = y”. Consider a partition of Sı into two subsets of terms W 
and W that each have probability mass at least y in the distribution of terms in topic I. 
Suppose that for every such partition, there is at least one edge between W and W. This 
would imply that the largest connected component S, in S, must have probability mass 
at least 1 — y. If S; had probability mass between y and 1 — y then using W = S; would 
violate the assumption about partitions with mass greater than y having an edge between 
them. If the largest partition S, had probability mass less than y, then one could create a 
union of connected components W that violates the assumption. Since Prob(S;) 21=3, 
the probability that a new random document of topic | contains only words not in S; is 
at most y” = e. Thus, if we can prove the statement about partitions, we will be able to 
correctly cluster nearly all new random documents. 


To prove the statement about partitions, fix some partition of S, into W and W that 
each have probability mass at least y. The probability that m words are all in W or W is 


at most Prob(W)™ + Prob(W)”. Thus the probability that none of n; documents creates 


an edge between W and W is 


(Prob(W)” + Prob(W)”)" (a elle a 


(1 — y/2)™)™ 


emm /2 


INTA IA 


where the first inequality is due to convexity and the second is a calculation. Since there 
are at most 2% different possible partitions of S; into W and W, the union bound ensures 
at most a ô probability of failure by having 


94 ¿mm /2 < ô. 


mn, > 2 (am2+1m5) a 
y ô 


This proves the following result. 


This in turn is satisfied for 


Lemma 9.1 If nm > > (dIn2+In+), then with probability at least 1— 6, the largest 
connected component in Si has probability mass at least 1 — y. This in turn implies that 
the probability to fail to correctly cluster a new random document of topic l is at most 


gS le: 


9.3 Non-negative Matrix Factorization - NMF 


We saw in Section 9.1, while the expected value E(A|B,C) equals BC, the variance 
can be high. Write 
A=BO EN 


where, N stands for noise. In topic modeling, N can be high. But it will be useful to first 
look at the problem when there is no noise. This can be thought of as the limiting case 
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as the number of words per document goes to infinity. 


Suppose we have the exact equations A = BC where A is the given matrix with non- 
negative entries and all column sums equal to 1. Given A and the number of topics r, 
can we find B and C such that A = BC where B and C have non-negative entries? This 
is called the Non-negative Matrix Factorization (NMF) problem and has applications be- 
sides topic modeling. If B and C are allowed to have negative entries, we can use Singular 
Value Decomposition on A using the top r singular vectors of A. 


Before discussing NMF, we will take care of one technical issue. In topic modeling, 
besides requiring B and C to be non-negative, we have additional constraints stemming 
from the fact that frequencies of terms in one particular topic are non-negative reals 
summing to one, and that the fractions of each topic that a particular document is on are 
also non-negative reals summing to one. All together, the constraints are: 


1. A= BC. 
2. The entries of B and C are all non-negative. 
3. Columns of both B and C sums to one. 

It will suffice to ensure the first two conditions. 


Lemma 9.2 Let A be a matrix with non-negaitve elements and columns summing to one. 
The problem of finding a factorization BC of A satisfying the three conditions above is 
reducible to the NMF problem of finding a factorization BC satisfying conditions (1) and 


(2). 


Proof: Suppose we have a factorization BC that satisfies (1) and (2) of a matrix A whose 
columns each sum to one. We can multiply the k** column of B by a positive real number 
and divide the k*” row of C by the same real number without violating A = BC. By doing 
this, we may assume that each column of B sums to one. Now we have a; = >>) bikCkj 
which implies >>, aj; = Dik bikCkj = Jy Cri, the sum of the ¿* column of C, Y, aij, is 1. 
Thus the columns of C sum to one giving (3). E 


Given an d x n matrix A and an integer r, the exact NMF problem is to determine 
whether there exists a factorization of A into BC where B is an d x r matrix with non- 
negative entries and C is r x n matrix with non-negative entries and if so, find such a 
factorization.®? 


Non-negative matrix factorization is a general problem and there are many heuristic 
algorithms to solve the problem. In general, they suffer from one of two problems. They 
could get stuck at local optima which are not solutions or take exponential time. In fact, 





39 B’s columns form a “basis” in which A’s columns can be expressed as non-negative linear combina- 
tions, the “coefficients” being given by matrix C. 
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the NMF problem is NP-hard. In practice, often r is much smaller than n and d. We show 
first that while the NMF problem as formulated above is a non-linear problem in r(n + d) 
unknowns (the entries of B and C), it can be reformulated as a non-linear problem with 
just 2r? unknowns under the simple non-degeneracy assumption that A has rank r. This, 
in turn, allows for an algorithm that runs in polynomial time when r is a constant. 


Lemma 9.3 If A has rank r, then the NMF problem can be formulated as a problem with 
2r? unknowns. Using this, the exact NMF problem can be solved in polynomial time if r 
is constant. 


Proof: If A = BC, then each row of A is a linear combination of the rows of C. So the 
space spanned by the rows of A is contained in the space spanned by the rows of the r x n 
matrix C. The latter space has dimension at most r, while the former has dimension r by 
assumption. So they must be equal. Thus every row of C must be a linear combination 
of the rows of A. Choose any set of r independent rows of A to form ar x m matrix Aj. 
Then C = SA, for some r x r matrix S. By analogous reasoning, if A is an x r matrix 
of r independent columns of A, there is a r x r matrix T such that B = AT. Now we 
can easily cast NMF in terms of unknowns S and T: 


A= ATSA: ; (SA1)i; > 0 ; (AID) ya > 0 Vi, J, k, l. 


It remains to solve the non-linear problem in 2r? variables. There is a classical algo- 
rithm which solves such problems in time exponential only in r? (polynomial in the other 
parameters). In fact, there is a logical theory, called the Theory of Reals, of which this 
is a special case and any problem in this theory can be solved in time exponential in the 
number of variables. We do not give details here. E 


9.4 NMF with Anchor Terms 


An important case of NMF, which can be solved efficiently, is the case where there are 
anchor terms. An anchor term for a topic is a term that occurs in the topic and does not 
occur in any other topic. For example, the term “batter” may be an anchor term for the 
topic baseball and “election” for the topic politics. Consider the case that each topic has 
an anchor term. This assumption is weaker than the separability assumption of Section 
9.2, which says that all terms are anchor terms. 


In matrix notation, the assumption that each topic has an anchor term implies that 
for each column of the term-topic matrix B, there is a row whose sole non-zero entry is 
in that column. 


Definition 9.3 (Anchor Term) For each j =1,2,...r, there is an index i; such that 


bij #0 and VkÆj b R=0. 


tj, 
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In this case, it is easy to see that each row of the topic-document matrix C has a 
scalar multiple of it occurring as a row of the given term-document matrix A. 


0.3 x Ca election [| 0 0 0 03 
T C1 > 
A = B — Ca > 
0.2 x ca batter | 0 02 0 0 eS 


If there is a NMF of A, there is one in which no row of C is a non-negative linear 
combination of other rows of C. If some row of C is a non-negative linear combination of 
the other rows of C, then eliminate that row of C as well as the corresponding column of 
B and suitably modify the other columns of B maintaining A = BC’. For example, if 


C5 = 4 X c3 +3 X C6, 


delete row 5 of C, add 4 times column 5 of B to column 3 of B, add 3 times column 5 of 
B to column 6 of B, and delete column 5 of B. After repeating this, each row of C is pos- 
itively independent of the other rows of C, i.e., it cannot be expressed as a non-negative 
linear combination of the other rows. 


If A = BC is a NMF of A and there are rows in A that are positive linear combinations 
of other rows, the rows can be remove and the corresponding rows of B remove to give a 
NMF A = BC where A and Ê are the matrices A and B with the removed rows. Since 
there are no rows in Â that are linear combinations of other rows of Á, Bisa diagonal 
matrix and the rows of Á are scalar multiples of rows of C. Now set C = Aand B=I 
and restore the rows to B to get B such that A = BC. 


To remove rows of A that are scalar multiples of previous rows in polynomial time 
check if there are real numbers 71, %2,...2j~1, Li+1, - - -Zn Such that 


S > ajay =aj Tj > 0. 
¡Ai 
This is a linear program and can be solved in polynomial time. While the algorithm 


runs in polynomial time, it requires solving one linear program per term. An improved 
method, not presented here, solves just one linear program. 


9.5 Hard and Soft Clustering 


In Section 9.2, we saw that under the assumptions that each document is purely on 
one topic and each term occurs in only one topic, approximately finding B was reducible 
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Figure 9.1: Geometry of Topic Modeling. The corners of the triangle are the columns 
of B. The columns of A for topic 1 are represented by circles, for topic 2 by squares, and 
for topic 3 by dark circles. Columns of BC (not shown) are always inside the big triangle, 
but not necessarily the columns of A. 


to clustering documents according to their topic. Clustering here has the usual meaning 
of partitioning the set of documents into clusters. We call this hard clustering, meaning 
each data point is to be assigned to a single cluster. 


The more general situation is that each document has a mixture of several topics. We 
may still view each topic as a cluster and each topic vector, i.e., each column of B, as 
a “cluster center” (Figure 9.1). But now, each document belongs fractionally to several 
clusters, the fractions being given by the column of C corresponding to the document. We 
may then view P(:, j) = BC(:,j) as the “cluster center” for document j. The document 
vector A(:, j) is its cluster center plus an offset or noise N(:, 7). 


Barring ties, each column of C has a largest entry. This entry is the primary topic 
of document 7 in topic modeling. Identifying the primary topic of each document is a 
“hard clustering” problem, which intuitively is a useful step in solving the “soft cluster- 
ing” problem of finding the fraction of each cluster each data point belongs to. “Soft 
Clustering” just refers to finding B and C so that N = A — BC is small. In this sense, 
soft clustering is equivalent to NMF. 


We will see in Sections 9.8 and 9.9 that doing hard clustering to identify the primary 
topic and using that to solve the soft clustering problem can be carried out under some 
assumptions. The primary topic of each document is used to find the “catchwords” of 
each topic, the important words in a weaker sense than anchor words, and then using 
the catchwords to find the term-topic matrix B and then C. But as stated earlier, the 
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general NMF problem is NP-hard. So, we make some assumptions before solving the 
problem. For this, we first look at Latent Dirichlet Allocation (LDA), which guides us 
towards reasonable assumptions. 


9.6 The Latent Dirichlet Allocation Model for Topic Modeling 


The most widely used model for topic modeling is the Latent Dirichlet Allocation 
(LDA) model. In this model, the topic weight vectors of the documents, the columns of 
C, are picked independently from what is known as a Dirichlet distribution. The term- 
topic matrix B is fixed. It is not random. The Dirichlet distribution has a parameter y 
called the “concentration parameter”, which is a real number in (0, 1), typically set to 
1/r. For each vector v with r non-negative components summing to one, 


1 ds 
Prob density ( column j of C = v) = —— He- 
gm) 7 


where, g(u) is the normalizing constant so that the total probability mass is one. Since 
u < 1, if any v = 0, then the probability density is infinite. 


Once C is generated, the Latent Dirichlet Allocation model hypothesizes that the 
matrix 


P = BC 
acts as the probability matrix for the data matrix A, namely, 
E(A|P) = P. 


Assume the model picks m terms from each document. Each trial is according to the 
multinomial distribution with probability vector P(:, j); so the probability that the first 
term we pick to include in the document j is the i*” term in the dictionary is p;;. Then, 
aj; is set equal to the fraction out of m of the number of times term 7 occurs in document j. 


The Dirichlet density favors low v;, but since the v, have to sum to one, there is at 
least one component that is high. We show that if yy is small, then with high probability, 
the highest entry of the column is typically much larger than the average. So, in each 
document, one topic, which may be thought as the “primary topic” of the document, gets 
disproportionately high weight. To prove this, we have to work out some properties of 
the Dirichlet distribution. The first Lemma describes the marginal probability density of 
each coordinate of a Dirichlet distributed random variable: 


Lemma 9.4 Suppose the joint distribution of y = (y1, Ya, ...,Yr) is the Dirichlet distri- 
bution with concentration parameter u. Then, the marginal probability density q(y) of yı 
is given by 

D(ru +1) 
Dw) 0 ((r — 1)u +1) 
where, T is the Gamma function (see Appendix for the definition). 





q(y) = y (1 — yy , u € (0,1), 
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Proof: By definition of the marginal, 


1 = ez: 
aly) = y" i Ta nie Yr (Yo Y3 Yr)” A dyz dyz ... dyr. 
) y2+y3 + +yr=l—y 
Put z = yı/(1 — y). With this change of variables, 
l -1 (r—1) -1 
a(y) = —— y" (1 T y) 5 22,23; Žr (2223 tee eel dzz dz3 Bi dzy A 
g(t) 224234 +2r=1-y 


The quantity inside the parentheses is independent of y, so for some c we have 
gly) = y (1 y). 
Since de q(y) dy = 1, we must have 


1 T(ru + 1) 


yA — yee TGI = Det 





C 


A 
Lemma 9.5 Suppose the joint distribution of y = (Yi, Ya, ...,Yr) is the Dirichlet distri- 
bution with parameter u € (0,1). For Ç € (0,1), 


0.85uce-Det1 
Prob(y > 1-0 > 
(yı > c) > (r—1ju +1 


Hence for u = 1/r, we have Prob(y, > 1 — C) > 0.4¢?/r. If also, ¢ < 0.5, then, 
Prob(Maxd_,yi > 1 — Å) > 0.40. 
Proof: Since u < 1, we have y'7! > 1 for y < 1 and so q(y) > c(1 — y)"—#, so 


1 
© 
qly) dy > De, 
[. (y) (r—1)u+1 


To lower bound c, note that T (u) < 1/p for y € (0,1). Also, P(x) is an increasing function 
for x > 1.5, so if (r — 1)u + 1 > 1.5, then, (ru +1) > T((r — 1)u + 1) and in this case, 
the first assertion of the lemma follows. If (r —1)u+1 € [1,1.5], then, r((r—1)u+1) < 1 
and [(ru +1) > man T(z) > 0.85, so again, the first assertion follows. 


If now, u = 1/r, then (r— 1)u +1 < 2 and so (0-D4+1/((r — 1)u+1) > (?/2. So the 
second assertion of the lemma follows easily. For the third assertion, note that yy > 1—C, 


l = 1,2,...,r are mutually exclusive events for Ç < 0.5 (since at most one y, can be 
greater than 1/2), so Prob (max y > 1- c) = J Prob(y > 1-—C) = rProb(y > 
Le). 0.40. A 
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For example, from the last lemma, it follows that 


1. With high probabilty, a constant fraction of the documents have a primary topic of 
weight at least 0.6. In expectation, the fraction of documents for which this holds 
is at least 0.4(0.6)?. 


2. Also with high probability, a smaller constant fraction of the documents are nearly 
pure (weight at least 0.95 on a single topic). Take ¢ = 0.05. 


If the total number of documents, n, is large, there will be many nearly pure doc- 
uments. Since for nearly pure documents, c,; > 0.95, BC.; = B(:,j) + A, where, 
NA] < 0.05. If we could find the nearly pure documents for a given topic l, then 
the average of the A columns corresponding to these documents will be close to the aver- 
age of those columns in the matrix BC (though this is not true for individual columns) 
and it is intuitively clear that we would be done. 


We pursue (1) and (2) in the next section, where we see that under these assumptions, 
plus one more assumption, we can indeed find B. 
More generally, the concentration parameter may be different for different topics. We 
then have u1, fla, ..., Hr so that 


Prob density ( column j of C = v) œ [o 
l=1 
The model fitting problem for Latent Dirichlet Allocation given A, find the B, the 
term-topic matrix, is in general NP-hard. There are heuristics, however, which are widely 
used. Latent Dirichlet Allocation is known to work well in several application areas. 


9.7 The Dominant Admixture Model 


In this section, we formulate a model with three key assumptions. The first two are mo- 
tivated by Latent Dirichlet Allocation, respectively by (1) and (2) of the last section. The 
third assumption is also natural; it is more realistic than the anchor words assumptions 
discussed earlier. This section is self-contained and no familiarity with Latent Dirichlet 
Allocation is needed. 


We first recall the notation. A is a d x n data matrix with one document per column, 
which is the frequency vector of the d terms in that document. m is the number of words 
in each document. r is the “inner dimension”, i.e., Bis d x r and C is r x n. We always 
index topics by l and l’, terms by 7, and documents by j. 


We give an intuitive description of the model assumptions first and then make formal 
statements. 
1. Primary Topic Each document has a primary topic. The weight of the primary 
topic in the document is high and the weight of each non-primary topic is low. 
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2. Pure Document Each topic has at least one pure document that is mostly on that 
topic. 


3. Catchword Each topic has at least one catchword, which has high frequency in 
that topic and low frequency in other topics. 


In the next section, we state quantitative versions of the assumptions and show that 
these assumptions suffice to yield a simple polynomial time algorithm to find the primary 
topic of each document. The primary topic classification can then be used to find B 
approximately, but this requires a further assumption (4) in Section (9.9) below, which is 
a robust version of the Pure Document assumption. 


Let's provide some intuition for how we are able to do the primary topic classification. 
By using the primary topic and catchword assumptions, we can show (quantitative version 
in Claim 9.1 below) that if ¿ is a catchword for topic l, then there is a threshold u;, which 
we can compute for each catchword, so that for each document j with primary topic J, 
Pij is above u; and for each document j whose primary topic is not l, p;; is substantially 
below u;. So, if 


1. we were given P, and 


2. knew a catchword for each topic and the threshold, we can find the primary topic 
of each document. 


We illustrate the situation in Equation 9.4, where rows 1,2,...,r of P correspond 
to catchwords for topics 1,2,...,r and we have rearranged columns in order of primary 
topic. H stands for a high entry and L for a low entry. 


H H HLLLLLLLLL 
L L LHHHLLLLLL 
L L LLLLHHHLLL 
P= L L LLLLLLLHHH (9.4) 


We are given A, but not P. While E(A|P) = P, A could be far off from P. In fact, 
if in column j of P, there are many entries smaller than c/m in A (since we are doing 
only m multinomial trials), they could all be zeros and so are not a good approxima- 
tion to the entries of P(:, j). However, if p;; > c/m, for a large value c, then aij = pij. 
Think of tossing a coin m times whose probability of heads is p;;. If pi; > c/m, then the 
number of heads one gets is close to p;jm. We will assume c larger than Q(log(nd)) for 
catchwords so that for every such 2 and j we have pij © aij. See the formal catchwords 
assumption in the next section. This addresses (1), namely, A ~ B, at least in these rows. 
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One can ask if this is a reasonable assumption. If m is in the hundreds, the assump- 
tion is arguably reasonable. But a weaker and more reasonable assumption would be that 
there is a set of catchwords, not just one, with total frequency higher than c/m. However, 
here we use the stronger assumption of a single high frequency catchword. 


(2) is more difficult to address. Let l(i) = arg maxj_, bi. Let T, be the set of j with 
primary topic l. Whether or not i is a catchword, the primary topic assumption will imply 
that p,; does not drop by more than a certain factor a among j € Tis). We prove this 
formally in Claim 9.1 of Section 9.8. That claim also proves that if i is a catchword for 
topic l, that there is a sharp drop in p,;; between j € T, and j ¢ T. 


But for non-catchwords, there is no guarantee of a sharp fall in p;; between j € Tio) 
and j ¢ Ti). However, we can identify for each 7, where the first fall of roughly a factor 
from the maximum occurs in row 7 of A. For catchwords, we show below (9.8) that this 
happens precisely between Tya and [n] \ Ti). For non-catchwords, we show that the fall 
does not occur among j € Tia). So, the minimal sets where the fall occurs are the T; and 
we use this to identify them. We call this process Pruning. 


9.8 Formal Assumptions 
Parameters a, 8, and 6 are real numbers in (0, 0.4] satisfying 


B +p < (1 — 3ô)a. (9.5) 


(1) Primary Topic There is a partition of |n] into T1, T2, ..., Tẹ with: 


>a  forj eT, 
Te DER (9.6) 
<B. forjgT. 
(2) Pure Document For each l, there is some j with 
Clj > 1-6. 
(3) Catchwords For each /, there is at least one catchword i satisfying: 
bir < pbi forl Al (9.7) 
log(10nd/ô 
bi > u, where, u = ele ee) , c constant. (9.8) 
mo?9? 
Let 
(i) = arg max biv. (9.9) 


Another way of stating the assumption bj, > p is that the expected number of times term 
i occurs in topic | among m independent trials is at least clog(10nd/6) /a?6? which grows 
only logarithmically in n and d. As stated at the end of the last section, the point of 
requiring b;, > pu for catchwords is so that using the Hoeffding-Chernoff inequality, we can 
assert that ai ~ pij- We state the Hoeffding-Chernoff inequality in the form we use it: 
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Lemma 9.6 


â 
Prov(|ai; — pi;j| > ĝa Maz(pij, u) /4) < TT: 


So, with probability at least 1 — (9/10), 
laij — pig] < da Maa, pij)/4 Vi, j 


simultaneously. After paying the failure probability of 9/10, we henceforth assume that 
the above holds. 


Proof: Since a;; is the average of m independent Bernoulli trials, each with expectation 
Pij, the Hoeffding-Chernoff inequality asserts that 


A2 
Prob (Jas; — pij| > A) < 2exp (ema (5,8) . 
Pij 


Plugging in A = adMax(p,;, )/4, the first statement of the lemma follows with some 
calculation. The second statement is proved by a union bound over the nd possible (i, 7) 
values. 


Algorithm 
1. Compute Thresholds: u; = a(1— ô) max aiz. 
j 
2. Do thresholding: Define a matrix A by 


A 1 if aj; > p; and p; > pa (1-2). 
Qij = E 
i O otherwise . 


3. Pruning: Let R; = {j|a;; = 1}. If any R; strictly contains another, set all 
entries of row 7 of A to zero. 


Theorem 9.7 Fori=1,2,...,d, let Ri = {j|@;; = 1} at the end of the algorithm. Then, 
each non-empty Ri = Ty), with I(t) as in (9.9). 


Proof: We start with a lemma which proves the theorem for catchwords. This is the 
bulk of the work in the proof of the theorem. 


Lemma 9.8 Ifi is a catchword for topic l, then Ri = T. 


Proof: Assume throughout this proof that i is a catchword for topic l. The proof consists 
of three claims. The first argues that for j € Tj, pi; is high and for j ¢ T), pij is low. The 
second claim argues the same for a;j instead of p;;. It follows from the Hoeffding-Chernoff 
inequality since a; is just the average of m Bernoulli trials, each with probability piz. 
The third claim shows that the threshold computed in the first step of the algorithm falls 
between the high and the low. 
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Claim 9.1 Fori, a catchword for topic l, 


bi > pij = bua forj €T, 
Pij < bua(1 — 39) forin. 


Proof: For j € T,, using (9.6) 


Pij —= X bivev € [b,10%, bul 


V=1 


since bj = max bj. For j ¢ Th, 
l 


Pij = buc + X bwc < buci + pball — cy) < ball + p) < bya(1 — 30), (9.10) 
VAL 
where, the first inequality is from (9.7) and the second inequality is because subject to the 
constraint c,; < 8 imposed by the Primary Topic Assumption (9.6), bic); + pbi(1 — ciz) is 
maximized when c; = 8. We have also used (9.5). 


Claim 9.2 With probability at least 1 — 6/10, for every l and every catchword i of l: 


f> buat — 6/4) for j € T, 
“UN < bya(1—(11/4)6), forjg T, 


Proof: Suppose for some j € Tı, aj; < bua(1 — 6/4). Then, since pi; > bua by Claim 
(9.1), Jai; — pil > dabys/4 by Claim (9.1). Since i is a catchword, by > pu and so 
lai; — pij| > daby/4 > (6aMax(p;;, 1)/4) and we get the first inequality of the current 
claim using Lemma 9.6. 

For the second inequality: for 7 € Tj, pi; < bua(1 — 30) by Claim 9.1 and so if this 
inequality is violated, |a;; — p;;| > buað/4 and we get a contradiction to Lemma (9.6). 
A 
Claim 9.3 With probability at least 1—0, for every topic l and every catchword i of topic L, 
the u; computed in step 1 of the algorithm satisfies: u; € ((1—(5/2)d) bya f baa(1—8/2)). 


Proof: If i is a catchword for topic l and jọ a pure document for l, then 


k 
Pijo = y biv Cvjo = Dacij, 2 (1 — ô)bi. 
V=1 
Applying Lemma 9.6, a;;, > (1— (3/2)ô)ba. Thus, u; computed in step 1 of the algorithm 
satisfies u; > (1 — (36/2))(1 — ô)baa > (1 — (5/2)d)aby. Hence, â;j is not set to zero for 
all j. Now, since p;; < by for all j, aj; < (1 + 8/4)bi by Lemma 9.6 implying 


li = Max,a,;(1 — d)a < ball + (8/4) (1 — d)a < bya] — 6/2). 
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Claims 9.2 and 9.3, complete the proof of Lemma 9.8. A 


The lemma proves Theorem 9.7 for catchwords. Note that since each topic has at least 
one catchword, for each l, there is some 7 with R; = Th. 


Suppose i is a non-catchword. Let a = max; aij. If a < u(1— (5d/2)) , then y, < 
na(1— (55/2)) and the entire row of A will be set to all zeros by the algorithm, so R; = 0 
and there is nothing to prove. Assume that a > u(1 — (50/2)). Let jo = arg max; dij. 
Then a = aijo > w(1 — (50/2)). We claim P; jo > a(1 — 6/2). If not, Pij < a(1 — 6/2) and 


Pijoð pad 
4'4)’ 





[dija — Pijo| > max ( 
which contradicts Lemma 9.6. So, 
da > Pij > p(1 — 30). (9.11) 
Let l = l(i). Then : 
a(1 — 6/2) < Pij = ye bi Cujo < diz. 


V=1 


Also, if jı is a pure document for topic l, cj} > (1 — 4) so, Pij > bucrj, > ball — ô). 
Now, we claim that 


If not, 


be ebb) et Gio) She) Sas (5, zat) 


contradicting Lemma 9.6. So (9.12) holds and thus, 
a > bull — (39/2)) (9.13) 
Now, for all j € Ti, pi; > bici > a(1 — 6/2)a. So, by applying Lemma 9.6 again, for all 
j ET, 
aij > all — ô)a. 


By step 1 of the algorithm, u; = a(1 — ô)a, so aij > p; for all j € Tı. So, either R; = T, 
or Ti € Ri. In the latter case, the pruning step will set 4, = 0 for all j, since topic / has 
some catchword to for which Ri, = T; by Lemma 9.8. 


9.9 Finding the Term-Topic Matrix 


For this, we need an extra assumption, which we first motivate. Suppose as in Section 
9.8, we assume that there is a single pure document for each topic. In terms of the Figure 
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9.1 of three topics, this says that there is a column of P close to each vertex of the tri- 
angle. But the corresponding column of A can be very far from this. So, even if we were 
told which document is pure for each topic, we cannot find the column of B. However, if 
we had a large number of nearly pure documents for each topic, since the corresponding 
columns of A are independent even conditioned on P, the average of these columns gives 
us a good estimate of the column of B. We also note that there is a justification for 
assuming the existence of a number of documents which are nearly pure for each topic 
based on the Latent Dirichlet Allocation model, (See (2) of Section 9.6). The assumption 
is: 


Assumption (4): Set of Pure Documents For each l, there is a set W, of at least 
ón documents with 


ð 


If we could find the set of pure documents for each topic with possibly a small fraction 
of errors, we could average them. The major task of this section is to state and prove an 
algorithm that does this. For this, we use the primary topic classification, T),7>,..., Tp 
from the last section. We know that a for catchword i of topic l, the maximum value of 
Pij, j = 1,2,...,n occurs for a pure document and indeed if the assumption above holds, 
the set of 9n/4 documents with the top 9n/4 values of p;; should be all pure documents. 
But to make use of this, we need to know the catchword, which we are not given. To 
discover them, we use another property of catchwords. If 7 is a catchword for topic J, 
then on Ty, 4 l, the values of p;; are (substantially) lower. So we know that if 7 is a 
catchword of topic l, then it has the property: 


Property: 6n/4 maximum value among pij, j € T, is substantially higher than 
than the 6n/4 maximum value among p;;, j € Ty for any V 4 I. 


We can computationally recognize the property for A (not P) and on the lines of 
Lemma 9.6, we can show that it holds essentially for A if and only if it holds for P. 


But then, we need to prove a converse of the statement above, namely we need to 
show that if the property holds for ¿ and l, then 7 is a catchword for topic l. Since catch- 
words are not necessarily unique, this is not quite true. But we will prove that any 1 
satisfying the property for topic l does have by < abi VU 4 L (Lemma 9.11) and so acts 
essentially like a catchword. Using this, we will show that the 6n/4 documents among all 
documents with the highest values of a;; for an 1 satisfying the property, will be nearly 
pure documents on topic / in Lemma 9.12 and use this to argue that their average gives 
a good approximation to column / of B (Theorem 9.13). 


The extra steps in the Algorithm: (By the theorem, the T;,1 = 1,2,...,r are now 
known.) 


1. For l =1,2,...,r, and for i=1,2,...,d, let g(i,1) be the (1 — (5/4))/4/* fractile of 
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[Ay FERRERA 
2. For each l, choose an ¿(1), (we will prove there is at least 1) such that 
gG, > (1- (8/2)u ; gG, V) < (1-20) a gG), WAL (9.14) 
3. Let Ry be the set of 6n/4 j’s among j = 1,2,...,n with the highest A;(g);- 


4. Return B.¡ = El jer, 4j as our approximation to B.. 


Lemma 9.9 i(l) satisfying (9.14) exists for each l. 


Proof: Let i be a catchword for l. Then, since, Vj € Wi, pij > buciy > bull — (6/4)) and 


bi > u, we have aj; > bi (1 — (6/2)) and so g(i,l) > (1 — (6/2))by > (1 — (9/2))u, by 


Lemma 9.6. For j é T, 
aij < bua(1 — (50/2)) 


by Claim 1.2 and so g(i,l’) < bua(1 — (58/2)). So g(i,l) satisfies both the requirements 
of step 2 of the algorithm. E 


Fix attention on one l. Let i = i(l). Let 
Pi = mas bik- 
Lemma 9.10 p, > (1 = 25) L- 


Proof: We have pi; = »;—, bikCkj < pi for all i. So, aij < pi+ ae max(, pi), from Lemma 
9.6. So, either, p; > u whence the lemma clearly holds or p; < u and 


ô 
Vi, ay S p+ Vi => gli) < pi + (a8/4)p 


By definition of i(l), gli, l) > (1 — (8/2) )u, so 


ad 


pit -7# (1 — (0/2), 


from which the lemma follows. 


Lemma 9.11 bj, < abı for all k Æl. 





40The y*” fractile of a set S of real numbers is the largest real number a so that at least y|S| elements 
of S are each greter than or equal to a. 
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Proof: Suppose not. Let 
lo x 
l = arg max Di. 


We have biy > aba and 








biv 
pi = Max(ba, bv) < A l (9.15) 
Since for 7 € Wy, at most 6/4 weight is put on topics other than I’, 
ô ô 9 0 
) / i; < bip (A — = — Pi it 1l-- a E wl 
Vi € We, py Shell PI + Go be (1-7 + 7) (9.16) 
Also, for 7 € Wy, 
ô 
Pij 2 bivcrj > bw (1 — 4) (9.17) 
By Lemma 9.6, 
: ad ô ad Pi 
Vj E€ Wr, aij È Pij — ZE max(, pij) > div (1 — 2) EFU pa (35/4) by Lemma 9.10 
3 
> biv (1 — 2) 4 (9.18) 


using (9.15) and ô < 0.4. From (9.18), it follows that 


Gat) >be (1 z 2) | (9.19) 


Since p;; < p; for all 7, using Lemma 9.10, 





7 


aô 1 ¡A 
i 


dde te A: 
Vj, aij < pi + (00/4)Max (1, pij) < pi (1 4 1— (36/4) 


by (9.15); this implies 





< bi (1 + (59/6)) 


g(i, l) (9.20) 
Now, the defintion of i(l) implies 

gli, l) < g(t, Da(1 — 28) < ba (1 + (59/6))a(1 — 28) /a < biw (1 — 6) 
contradicting (9.19) and proving Lemma 9.11. 
Lemma 9.12 For each j € R, of step 3 of the algorithm, we have 


ej > = 20. 
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Proof: Let J = {j : qj; < 1 — 2ô}. Take a j € J. We argue that j ¢ Ry. 
Pij < ball = 20) + 2dab; < ball = 1.28), 


by Lemma 9.11 using a < 0.4. So for j € J, we have 
að 
Qij < ball = 1.26) + ra max(u, bit) < bu (1 = 9). 


But 
Vj € Wi, Pij > bi (1 = $ => lij > ball zx 9) = gli, l) > bu (1 _ ô). 


So for no j € J is ai; > g(i,l) and hence no j € J belong sto Ri. 


Theorem 9.13 Assume 
cd c 
n > 


= mõ E gr 


For alll, 1 <1l< r, the b.i returned by step 4 of the Algorithm satisfies 
lb. 1 — Ballı < 66. 


Proof: Recall that BC = P. Let V = A — P. From Lemma 9.12, we know that for each 
J € Ri, Gj > 1—28. So 
P¿=(1-B,+w, 


where, y < 29 and v is a combination of other columns of B with ||v||ı < y < 26. Thus, 
we have that 


























1 
il XO pj — bal] < 26. (9.21) 
JER, 1 
So it suffices now to show that 
1 
Ri Sop. — all < 26. (9.22) 
" ser i 


Note that for an individual j € R, |/a.j — p.j||1 can be almost two. For example, if each 
pij = 1/d, then, A;; would be 1/m for a random subset of m j ’s and zero for the rest. 
What we exploit is that when we average over Q(n) j ’s in R, the error is small. For this, 
the independence of a.;, 7 € R¡ would be useful. But they are not necessarily independent, 
there being conditioning on the fact that they all belong to R;. But there is a simple way 
around this conditioning. Namely, we prove (9.22) with very high probability for each 
R C [n], |R| = 6n/4 and then just take the union bound over all (ny) such subsets. 


We know that E (a.;) = p.j- Now consider the random variable x defined by 


JER 


1 
ered 
|R| 














1 
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x is a function of m|R| independent random variables, namely, the choice of m|R| terms 
in the |R| documents. Changing any one, changes x by at most 1/m|R|. So the Bounded 
Difference Inequality from probability (??? REF ???) implies that 


Prob (|x — Ex| > 5) < 2exp (—6?6mn/8) . (9.23) 
We also have to bound E(x). 


B(x) = me ( Eva) ml mE 


jER 


) Vij 


JER 











y 2 

1 

ey y E (x 0) Jensen’s inequality:E(y) < y E(y?) 

|R| T 
1 


d 
= ir ` y E(v;;) since [u;;, j € R}are indep. and var adds up 
i=1 Y jer 


Va (a 1/2 
< IR] > ya Ep) Chauchy-Schwartz 


i=l jeR 








"< dD Ev) < vd <å 


~ ymn 


since Elu;;) = pi¡/m and YX; p; = 1 and by hypothesis, n > cd/md°. Using this along 
with (9.23), we see that for a single R C {1,2,...,n} with |R| = 6n/4, 


>20 | <2exp (—cóimn) : 


1 
Prob == Vi; 
md: 


jeR 








1 


which implies using the union bound that 


1 
rad” 


jER 





Prob | 4R,|R| = > 20 | < 2exp (—cóémn + cón) <ô, 


ón 
a 











1 


because the number of Ris (na) < (en/dn)*"/* < exp(cón) and m > c/8? by hypothesis. 
This completes the proof of the theorem. E 


9.10 Hidden Markov Models 


A hidden Markov model (HMM) consists of a finite set of states with a transition 
between each pair of states. There is an initial probability distribution a on the states 
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and a transition probability a;; associated with the transition from state i to state j. Each 
state also has a probability distribution p(O,i) giving the probability of outputting the 
symbol O in state 1. A transition consists of two components. A state transition to a 
new state followed by the output of a symbol. The HMM starts by selecting a start state 
according to the distribution œ and outputting a symbol. 


Example: An example of an HMM with two states q and p and two output symbols h 
and t is illustrated below. 


AIW 


NI= 


So 2 


The initial distribution is (q) = 1 and a(p) = 0. At each step a change of state occurs 
followed by the output of heads or tails with probability determined by the new state. 


We consider three problems in increasing order of difficulty. First, given an HMM 
what is the probability of a given output sequence? Second, given an HMM and an out- 
put sequence, what is the most likely sequence of states? And third, knowing that the 
HMM has at most n states and given an output sequence, what is the most likely HMM? 
Only the third problem concerns a “hidden” Markov model. In the other two problems, 
the model is known and the questions can be answered in polynomial time using dynamic 
programming. There is no known polynomial time algorithm for the third question. 


How probable is an output sequence 


Given an HMM, how probable is the output sequence O = O00102- - -Or of length 
T +1? To determine this, calculate for each state i and each initial segment of the sequence 
of observations, O0/0» - -- O, of length t + 1, the probability of observing Oy0,0»3--+-O, 
ending in state i. This is done by a dynamic programming algorithm starting with t = 0 
and increasing t. For t = 0 there have been no transitions. Thus, the probability of 
observing Op ending in state 1 is the initial probability of starting in state 7 times the 
probability of observing Oo in state 7. The probability of observing O00102- - - O, ending 
in state į is the sum of the probabilities over all states j of observing O00102- - -O1 
ending in state j times the probability of going from state j to state 1 and observing O;,. 
The time to compute the probability of a sequence of length T when there are n states is 
O(n?T). The factor n? comes from the calculation for each time unit of the contribution 
from each possible previous state to the probability of each possible current state. The 
space complexity is O(n) since one only needs to remember the probability of reaching 


338 


each state for the most recent value of t. 
Algorithm to calculate the probability of the output sequence 


The probability, Prob(O¿0; --- Or, i) of the output sequence Oy0; --- Or ending in 
state 1 is given by 


Prob(Qp, i) = a(i)p(Oo, i) 
and for t = 1 to T 
Prob(O001 Die Or, i) = Za Prob(O QO; sirens O1-1, J) aijp(Or41, i). 
J 


Example: What is the probability of the sequence hhht by the HMM in the above two 
state example? 
































pea 311, 531 _ 19 311, 511 37 
32-9:9:°) IAD 384 3223 ' 7243 64x27 

t=2 Lii TIST 3 112 | 112 _ 5 
822 ' 642 — 32 823 ' 643 72 

= 111 1 112_ 1 

t=1 2227 8 223 6 

= 1 

t=0 | 0 

q P 














For t = 0, the q entry is 1/2 since the probability of being in state q is one and the proba- 
bility of outputting heads is 3. The entry for p is zero since the probability of starting in 
state p is zero. For t = 1, the q entry is 3 since for t = 0 the q entry is z and in state q 
the HMM goes to state q with probability 5 and outputs heads with probability 3. The 
p entry is 1 since for t = 0 the q entry is 5 and in state q the HMM goes to state p with 
probability t and outputs heads with probability 2, For t = 2, the q entry is 55 which 
consists of two terms. The first term is the probability of ending in state q at t = 1 times 
the probability of staying in q and outputting h. The second is the probability of ending 


in state p at t = 1 times the probability of going from state p to state q and outputting h. 


From the table, the probability of producing the sequence hhht is 2 + ok = 0.0709. 
A 


The most likely sequence of states - the Viterbi algorithm 


Given an HMM and an observation O = OpO,--- Or, what is the most likely sequence 
of states? The solution is given by the Viterbi algorithm, which is a slight modification 
to the dynamic programming algorithm just given for determining the probability of an 
output sequence. For t = 0,1,2,...,7 and for each state i, we calculate the probability 
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of the most likely sequence of states to produce the output O00102- - - O; ending in state 
i as follows. For each state j, we have already computed the probability of the most 
likely sequence producing 040,0» ---O,_1 ending in state 7, and we multiply this by the 
probability of the transition from j to i producing O;. We then select the j for which this 
product is largest. Note that in the previous example, we added the probabilities of each 
possibility together. Now we take the maximum and also record where the maximum 
came from. The time complexity is O(n?T) and the space complexity is O(nT). The 
space complexity bound is argued as follows. In calculating the probability of the most 
likely sequence of states that produces O jO,...O; ending in state i, we remember the 
previous state 7 by putting an arrow with edge label t from 7 to j. At the end, can find 
the most likely sequence by tracing backwards as is standard for dynamic programming 
algorithms. 


Example: For the earlier example what is the most likely sequence of states to produce 
the output hhht? 






































t=3 max{7555> 2412} = 34 IP max{ sg 2143} = 96 9 
t=2 max{s55ga3) = ag P max(s33 513) =a 4 
t=1 |iil=i q 2336 0 
t=0 |i q 0 p 

q p 








Note that the two sequences of states, qqpq and qpqq, are tied for the most likely se- 


quences of states. 


Determining the underlying hidden Markov model 


Given an n-state HMM, how do we adjust the transition probabilities and output prob- 
abilities to maximize the probability of an output sequence 0,0» - - - Or? The assumption 
is that T is much larger than n.*! There is no known computationally efficient method 
for solving this problem. However, there are iterative techniques that converge to a local 
optimum. 


Let a;; be the transition probability from state i to state j and let b;(O,) be the 
probability of output Ox given that the HMM is in state 7. Given estimates for the HMM 
parameters, a;; and bj, and the output sequence O, we can improve the estimates by 
calculating for each time step the probability that the HMM goes from state z to state j 
and outputs the symbol Oj, conditioned on O being the output sequence. 





4f T < n then one can just have the HMM be a linear sequence that outputs 0102... Or with 
probability 1. 
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Qij transition probability from state 7 to state 7 
b;(Or41) probability of O,+1 given that the HMM is in state j at time t+ 1 
ali) probability of seeing O01 -- - O, and ending in state 7 at time t 


Pir1[g) probability of seeing the tail of the sequence O;+20++3 : - -Or given state j 
at time t+ 1 


ôli, j) probability of going from state 2 to state j at time t given the sequence 
of outputs O 


si (1) probability of being in state ¿ at time t given the sequence of outputs O 


p(O) probability of output sequence O 











Given estimates for the HMM parameters, a;; and b;, and the output sequence O, the 
probability 6,(7,7) of going from state i to state j at time t is given by the probability of 
producing the output sequence O and going from state 1 to state j at time t divided by 
the probability of producing the output sequence O. 


a (7) aigd;(Or41) B41 (J) 
p(O) 


The probability p(O) is the sum over all pairs of states i and j of the numerator in the 
above formula for 6;(i, 7). That is, 


p(O) = a DE a (j)aijbjlOr+1)Pr+r (J). 





ôli, j) = 


The probability of being in state 1 at time t is given by 
s(t) = S (i, 5). 
j=l 


Summing s;(i) over all time periods gives the expected number of times state 7 is visited 
and the sum of 6;(7, 7) over all time periods gives the expected number of times edge 1 to 
j is traversed. 


Given estimates of the HMM parameters a, ¿ and b;(O;,), we can calculate by the above 
formulas estimates for 


1. 2) sy(i), the expected number of times state i is visited and departed from 
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2. Na 9, (1, 7), the expected number of transitions from state i to state j 
Using these estimates we can obtain new estimates of the HMM parameters 


_ expected number of transitions from state i to state j en belt j) 








expected number of transitions out of state i y aE * sli) 
Ti l 
>» selj) 
i=1 
subject to 
(Ox) = expected number of times in state j observing symbol Og _ Or On 
rha T expected number of times in state j 7 È zy selj) 


By iterating the above formulas we can arrive at a local optimum for the HMM parameters 
Qij and b;(Ox). 


9.11 Graphical Models and Belief Propagation 


A graphical model is a compact representation of a probability distribution over n 
variables £1, %2,...,2p. It consists of a graph, directed or undirected, whose vertices cor- 
respond to variables that take on values from some set. In this chapter, we consider the 
case where the set of values the variables take on is finite, although graphical models are 
often used to represent probability distributions with continuous variables. The edges of 
the graph represent relationships or constraints between the variables. 


In the directed model, it is assumed that the directed graph is acyclic. This model 
represents a joint probability distribution that factors into a product of conditional prob- 


abilities. E 


PGi . , Zn) = II» (x;|parents of x;) 
i=1 
The directed graphical model is called a Bayesian or belief network and appears frequently 
in the artificial intelligence and the statistics literature. 


The undirected graphical model, called a Markov random field, can also represent a 
joint probability distribution of the random variables at its vertices. In many applications 
the Markov random field represents a function of the variables at the vertices which is to 
be optimized by choosing values for the variables. 


A third model called the factor model is akin to the Markov random field, but here 


the dependency sets have a different structure. In the following sections we describe all 
these models in more detail. 
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Figure 9.2: A Bayesian network 


9.12 Bayesian or Belief Networks 


A Bayesian network is a directed acyclic graph where vertices correspond to variables 
and a directed edge from y to x represents a conditional probability p(a|y). If a vertex x 
has edges into it from y1, y2,..-, Yk, then the conditional probability is p (æ | y1, ya,.--, Yk). 
The variable at a vertex with no in edges has an unconditional probability distribution. 
If the value of a variable at some vertex is known, then the variable is called evidence. 
An important property of a Bayesian network is that the joint probability is given by the 
product over all nodes of the conditional probability of the node conditioned on all its 
immediate predecessors. 


In the example of Fig. 9.1, a patient is ill and sees a doctor. The doctor ascertains 
the symptoms of the patient and the possible causes such as whether the patient was in 
contact with farm animals, whether he had eaten certain foods, or whether the patient 
has an hereditary predisposition to any diseases. Using the above Bayesian network where 
the variables are true or false, the doctor may wish to determine one of two things. What 
is the marginal probability of a given disease or what is the most likely set of diseases. In 
determining the most likely set of diseases, we are given a T or F assignment to the causes 
and symptoms and ask what assignment of T or F to the diseases maximizes the joint 
probability. This latter problem is called the mazimum a posteriori probability (MAP). 


Given the conditional probabilities and the probabilities p (C1) and p (C2) in Figure 
9.1, the joint probability p (C1, Co, Di,...) can be computed easily for any combination 
of values of C1, Ca, D,,.... However, we might wish to find the value of the variables of 
highest probability (MAP) or we might want one of the marginal probabilities p(D,) or 
p(D3). The obvious algorithms for these two problems require evaluating the probabil- 
ity p (C1, C2, Di,...) over exponentially many input values or summing the probability 
p(C1, C2, Di,...) over exponentially many values of the variables other than those for 
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which we want the marginal probability. In certain situations, when the joint probability 
distribution can be expressed as a product of factors, a belief propagation algorithm can 
solve the maximum a posteriori problem or compute all marginal probabilities quickly. 


9.13 Markov Random Fields 


The Markov random field model arose first in statistical mechanics where it was called 
the Ising model. It is instructive to start with a description of it. The simplest version 
of the Ising model consists of n particles arranged in a rectangular yn x yn grid. Each 
particle can have a spin that is denoted +1. The energy of the whole system depends 
on interactions between pairs of neighboring particles. Let x; be the spin, +1, of the ¿* 
particle. Denote by į ~ j the relation that 7 and 7 are adjacent in the grid. In the Ising 
model, the energy of the system is given by 


f(21,T2,..., Tn) = exp 03 |z; - =I) . 


inj 
The constant c can be positive or negative. If c < 0, then energy is lower if many adjacent 


pairs have opposite spins and if c > 0 the reverse holds. The model was first used to 
model probabilities of spin configurations in physical materials. 


In most computer science settings, such functions are mainly used as objective func- 
tions that are to be optimized subject to some constraints. The problem is to find the 
minimum energy set of spins under some constraints on the spins. Usually the constraints 
just specify the spins of some particles. Note that when c > 0, this is the problem of 
minimizing >> |z; — x;| subject to the constraints. The objective function is convex and 
so this can be dene efficiently. If c < 0, however, we need to minimize a concave function 
for which there is no known efficient algorithm. The minimization of a concave function 
in general is NP-hard. Intuitively, this is because the set of inputs for which f(x) is less 
than some given value can be non-convex or even consist of many disconnected regions. 


A second important motivation comes from the area of vision. It has to to do with 
reconstructing images. Suppose we are given noisy observations of the intensity of light at 
individual pixels, £1, %2,...,%p, and wish to compute the true values, the true intensities, 
of these variables y1, Y2, .--, Yn- There may be two sets of constraints, the first stipulating 
that the y; should generally be close to the corresponding x; and the second, a term 
correcting possible observation errors, stipulating that y; should generally be close to the 
values of y; for j ~ i. This can be formulated as 


min (= zi — Ya] + Ex Yi 0) 
i inj 


where the values of x; are constrained to be the observed values. The objective function 
is convex and polynomial time minimization algorithms exist. Other objective functions 
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Figure 9.3: The factor graph for the function 
f (x1, T2, T3) = (xy + La + z3) (z1 + 12) 11 + 13)(L2 + T3). 


using say sum of squares instead of sum of absolute values can be used and there are 
polynomial time algorithms as long as the function to be minimized is convex. 


More generally, the correction term may depend on all grid points within distance 
two of each point rather than just immediate neighbors. Even more generally, we may 
have n variables y1, Ya, ...Yn With the value of some of them already specified and subsets 
S1,53,... Sm Of these variables constrained in some way. The constraints are accumulated 
into one objective function which is a product of functions fi, f2,..., fm, where function f; 
is evaluated on the variables in subset S;. The problem is to minimize Į [;4; fi(y;,7 € Si) 
subject to constrained values. Note that the vision example had a sum instead of a prod- 
uct, but by taking exponentials we can turn the sum into a product as in the Ising model. 


In general, the f; are not convex; indeed they may be discrete. So the minimization 
cannot be carried out by a known polynomial time algorithm. The most used forms of the 
Markov random field involve S; which are cliques of a graph. So we make the following 
definition. 


A Markov Random Field consists of an undirected graph and an associated function 
that factorizes into functions associated with the cliques of the graph. The special case 
when all the factors correspond to cliques of size one or two is of interest. 


9.14 Factor Graphs 


Factor graphs arise when we have a function f of a variables x = (11, £2,..., £n) that 
can be expressed as f(x) = [[ fa (xa) where each factor depends only on some small 


number of variables x,. The difference from Markov random fields is that the variables 
corresponding to factors do not necessarily form a clique. Associate a bipartite graph 
where one set of vertices correspond to the factors and the other set to the variables. 
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Place an edge between a variable and a factor if the factor contains that variable. See 
Figure 9.3. 


9.15 Tree Algorithms 


Let f(x) be a function that is a product of factors. When the factor graph is a tree 
there are efficient algorithms for solving certain problems. With slight modifications, the 
algorithms presented can also solve problems where the function is the sum of terms rather 
than a product of factors. 


The first problem is called marginalization and involves evaluating the sum of f over 
all variables except one. In the case where f is a probability distribution the algorithm 
computes the marginal probabilities and thus the word marginalization. The second prob- 
lem involves computing the assignment to the variables that maximizes the function f. 
When f is a probability distribution, this problem is the maximum a posteriori probabil- 
ity or MAP problem. 


If the factor graph is a tree (such as in Figure 9.4), then there exists an efficient al- 
gorithm for solving these problems. Note that there are four problems: the function f 
is either a product or a sum and we are either marginalizing or finding the maximizing 
assignment to the variables. All four problems are solved by essentially the same algo- 
rithm and we present the algorithm for the marginalization problem when f is a product. 
Assume we want to “sum out” all the variables except x1, leaving a function of 71. 


Call the variable node associated with some variable x; node x;. First, make the node 
xı the root of the tree. It will be useful to think of the algorithm first as a recursive 
algorithm and then unravel the recursion. We want to compute the product of all factors 
occurring in the sub-tree rooted at the root with all variables except the root-variable 
summed out. Let g; be the product of all factors occurring in the sub-tree rooted at 
node x; with all variables occurring in the subtree except x; summed out. Since this is a 
tree, x, will not reoccur anywhere except the root. Now, the grandchildren of the root 
are variable nodes and suppose inductively, each grandchild x; of the root, has already 
computed its g;. It is easy to see that we can compute gı as follows. 


Each grandchild x; of the root passes its g; to its parent, which is a factor node. Each 
child of x, collects all its children’s g;, multiplies them together with its own factor and 
sends the product to the root. The root multiplies all the products it gets from its children 
and sums out all variables except its own variable, namely here z1. 


Unraveling the recursion is also simple, with the convention that a leaf node just re- 
ceives 1, product of an empty set of factors, from its children. Each node waits until it 
receives a message from each of its children. After that, if the node is a variable node, 
it computes the product of all incoming messages, and sums this product function over 
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Figure 9.4: The factor graph for the function f = x; (11 + £2 + 23) (13 + £4 + £5) £425. 


all assignments to the variables except for the variable of the node. Then, it sends the 
resulting function of one variable out along the edge to its parent. If the node is a factor 
node, it computes the product of its factor function along with incoming messages from 
all the children and sends the resulting function out along the edge to its parent. 


The reader should prove that the following invariant holds assuming the graph is a tree: 


Invariant The message passed by each variable node to its parent is the product of 
all factors in the subtree under the node with all variables in the subtree except its own 
summed out. 


Consider the following example where 
f = £1 (£1 + £24 23) (23 + £4 + 05) 2425 


and the variables take on values 0 or 1. Consider marginalizing f by computing 


Flia) > 11 (£1 + T2 + T3) (13 + £4 + 25) L425, 


LQAXZLALS 


In this case the factor graph is a tree as shown in Figure 9.4. The factor graph as a 
rooted tree and the messages passed by each node to its parent are shown in Figure 9.5. 
If instead of computing marginals, one wanted the variable assignment that maximizes 
the function f, one would modify the above procedure by replacing the summation by a 
maximization operation. Obvious modifications handle the situation where f(x) is a sum 
of products. 

roe = Y 9(x) 


T1)+ Un 


9.16 Message Passing in General Graphs 


The simple message passing algorithm in the last section gives us the one variable 
function of zı when we sum out all the other variables. For a general graph that is not 
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Figure 9.5: Messages. 


a tree, we formulate an extension of that algorithm. But unlike the case of trees, there 
is no proof that the algorithm will converge and even if it does, there is no guarantee 
that the limit is the marginal probability. This has not prevented its usefulness in some 
applications. 


First, lets ask a more general question, just for trees. Suppose we want to compute 
for each 7 the one-variable function of x, when we sum out all variables z;, j 4 i. Do we 
have to repeat what we did for xı once for each x;? Luckily, the answer is no. It will 
suffice to do a second pass from the root to the leaves of essentially the same message 
passing algorithm to get all the answers. Recall that in the first pass, each edge of the 
tree has sent a message “up”, from the child to the parent. In the second pass, each edge 
will send a message down from the parent to the child. We start with the root and work 
downwards for this pass. Each node waits until its parent has sent it a message before 
sending messages to each of its children. The rules for messages are: 
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Rule 1 The message from a factor node v to a child x;, which is the variable node 2;, 
is the product of all messages received by v in both passes from all nodes other than z; 
times the factor at v itself. 


Rule 2 The message from a variable node x; to a factor node child, v, is the product 
of all messages received by x; in both passes from all nodes except v, with all variables 
except x; summed out. The message is a function of x; alone. 


At termination, when the graph is a tree, if we take the product of all messages re- 
ceived in both passes by a variable node x; and sum out all variables except x; in this 
product, what we get is precisely the entire function marginalized to x;. We do not give 
the proof here. But the idea is simple. We know from the first pass that the product of 
the messages coming to a variable node x; from its children is the product of all factors in 
the sub-tree rooted at x;. In the second pass, we claim that the message from the parent 
v to x; is the product of all factors which are not in the sub-tree rooted at x; which one 
can show either directly or by induction working from the root downwards. 


We can apply the same rules 1 and 2 to any general graph. We do not have child and 
parent relationships and it is not possible to have the two synchronous passes as before. 
The messages keep flowing and one hopes that after some time, the messages will stabilize, 
but nothing like that is proven. We state the algorithm for general graphs now: 


Rule 1 At each unit of time, each factor node v sends a message to each adjacent 
node x;. The message is the product of all messages received by v at the previous step 
except for the one from x; multiplied by the factor at v itself. 


Rule 2 At each time, each variable node x; sends a message to each adjacent node v. 
The message is the product of all messages received by x; at the previous step except the 
one from v, with all variables except x; summed out. 


9.16.1 Graphs with a Single Cycle 


The message passing algorithm gives the correct answers on trees and on certain other 
graphs. One such situation is graphs with a single cycle which we treat here. We switch 
from the marginalization problem to the MAP problem as the proof of correctness is 
simpler for the MAP problem. Consider the network in Figure 9.6a with a single cycle. 
The message passing scheme will count some evidence multiply. The local evidence at A 
will get passed around the loop and will come back to A. Thus, A will count the local 
evidence multiple times. If all evidence is multiply counted in equal amounts, then there 
is a possibility that though the numerical values of the marginal probabilities (beliefs) are 
wrong, the algorithm still converges to the correct maximum a posteriori assignment. 


Consider the unwrapped version of the graph in Figure 9.6b. The messages that the 


349 






































(a) A graph with a single cycle 































































































(b) Segment of unrolled graph 


Figure 9.6: Unwrapping a graph with a single cycle 


loopy version will eventually converge to, assuming convergence, are the same messages 
that occur in the unwrapped version provided that the nodes are sufficiently far in from 
the ends. The beliefs in the unwrapped version are correct for the unwrapped graph since 
it is a tree. The only question is, how similar are they to the true beliefs in the original 
network. 


Write p (A, B,C) = elver(4,B,0) — ed(4,B,0) where J (A,B,C) = logp (A, B,C). Then 
the probability for the unwrapped network is of the form e*/(42-+/" where the J’ is 
associated with vertices at the ends of the network where the beliefs have not yet stabi- 
lized and the kJ (A, B,C) comes from k inner copies of the cycle where the beliefs have 
stabilized. Note that the last copy of J in the unwrapped network shares an edge with J’ 
and that edge has an associated Y. Thus, changing a variable in J has an impact on the 
value of J’ through that Y. Since the algorithm maximizes J, = kJ (A, B,C) + J’ in the 
unwrapped network for all k, it must maximize J (A, B,C). To see this, set the variables 
A, B, C, so that J; is maximized. If J (A, B,C) is not maximized, then change A, B, and 
C to maximize J (A, B,C). This increases J; by some quantity that is proportional to 
k. However, two of the variables that appear in copies of J (A,B,C) also appear in J’ 
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Figure 9.7: A Markov random field with a single loop. 


and thus J’ might decrease in value. As long as J’ decreases by some finite amount, we 
can increase Ją by increasing k sufficiently. As long as all W’s are non-zero, J’ which is 
proportional to log WV, can change by at most some finite amount. Hence, for a network 
with a single loop, assuming that the message passing algorithm converges, it converges 
to the maximum a posteriori assignment. 


9.16.2 Belief Update in Networks with a Single Loop 


In the previous section, we showed that when the message passing algorithm converges, 
it correctly solves the MAP problem for graphs with a single loop. The message passing 
algorithm can also be used to obtain the correct answer for the marginalization problem. 
Consider a network consisting of a single loop with variables £1, 72,..., £n and evidence 
Y1, Y2,- --, Yn as shown in Figure 9.7. The x; and y; can be represented by vectors having 
a component for each value x; can take on. To simplify the discussion assume the x; take 
on values 1,2,...,m. 


Let m; be the message sent from vertex 1 to vertex i + 1 mod n. At vertex i + 1 
each component of the message m; is multiplied by the evidence y;,; and the constraint 
function Y. This is done by forming a diagonal matrix D;,, where the diagonal elements 
are the evidence and then forming a matrix M; whose jkt” element is Y (2,4, = j, 2, = k). 
The message m;,y1 is M;Dj.,m;. Multiplication by the diagonal matrix D;,, multiplies 
the components of the message m; by the associated evidence. Multiplication by the 
matrix M; multiplies each component of the vector by the appropriate value of Ų and 
sums over the values producing the vector which is the message m;+1. Once the message 
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has travelled around the loop, the new message m/ is given by 
me = M,D1Mn-1Dn pes M2D3M; Dam: 


Let M = M,D,M,-1Dn---M2D3M,Dom,. Assuming that M’s principal eigenvalue is 
unique, the message passing will converge to the principal vector of M. The rate of con- 
vergences depends on the ratio of the first and second eigenvalues. 


An argument analogous to the above concerning the messages going clockwise around 
the loop applies to messages moving counter-clockwise around the loop. To obtain the es- 
timate of the marginal probability p (x1), one multiples component-wise the two messages 
arriving at x, along with the evidence yı. This estimate does not give the true marginal 
probability but the true marginal probability can be computed from the estimate and the 
rate of convergences by linear algebra. 


9.16.3 Maximum Weight Matching 


We have seen that the belief propagation algorithm converges to the correct solution 
in trees and graphs with a single cycle. It also correctly converges for a number of prob- 
lems. Here we give one example, the maximum weight matching problem where there is 
a unique solution. 


We apply the belief propagation algorithm to find the maximal weight matching 
(MWM) in a complete bipartite graph. If the MWM in the bipartite graph is unique, 
then the belief propagation algorithm will converge to it. 


Let G = (Vi, Va, E) be a complete bipartite graph where V, = {a1,..., an}, Va = 
{b1,..., bn}, and (a;,bj) E E, 1 < i,7 < n. Let m = [r(1),...,T(n)) be a per- 
mutation of {1,...,n}. The collection of edges { (a1, bx(1)) A (an, Dates) }is called a 
matching which is denoted by 7. Let w;; be the weight associated with the edge (a;, b;). 


The weight of the matching 7 is wr = »Wir(i) The maximum weight matching 7* is 
T* = arg max Wr 

The first step is to create a factor graph corresponding to the MWM problem. Each 
edge of the bipartite graph is represented by a variable c;; which takes on the value zero or 
one. The value one means that the edge is present in the matching, the value zero means 


that the edge is not present in the matching. A set of constraints is used to force the set 
of edges to be a matching. The constraints are of the form 2 Cy = 1 and 2 Ci; = 1. Any 


0,1 assignment to the variables c;; that satisfies all of the Lonsigaints deines a matching. 
In addition, we have constraints for the weights of the edges. 


We now construct a factor graph, a portion of which is shown in Figure 9.8. Associated 
with the factor graph is a function f (c11, C12, . . .) consisting of a set of terms for each c;; 
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enforcing the constraints and summing the weights of the edges of the matching. The 


terms for c;, are 
i j 


where A is a large positive number used to enforce the constraints when we maximize the 
function. Finding the values of c11, C12, ... that maximize f finds the maximum weighted 
matching for the bipartite graph. 


=A =A 


+ W12C12 














If the factor graph was a tree, then the message from a variable node x to its parent 
is a message g(x) that gives the maximum value for the subtree for each value of x. To 
compute g(x), one sums all messages into the node x. For a constraint node, one sums 
all messages from subtrees and maximizes the sum over all variables except the variable 
of the parent node subject to the constraint. The message from a variable x consists of 
two pieces of information, the value p(x = 0) and the value p(x = 1). This information 
can be encoded into a linear function of x: 


[p (x = 1) =p(1=0)] £ + p (x = 0). 


Thus, the messages are of the form ax + b. To determine the MAP value of x once the 
algorithm converges, sum all messages into x and take the maximum over x=1 and x=0 
to determine the value for x. Since the arg maximum of a linear form ax+b depends 
only on whether a is positive or negative and since maximizing the output of a constraint 
depends only on the coefficient of the variable, we can send messages consisting of just 
the variable coefficient. 


To calculate the message to cj2 from the constraint that node by has exactly one 
neighbor, add all the messages that flow into the constraint node from the cj, i 4 1 
nodes and maximize subject to the constraint that exactly one variable has value one. If 
Cig = 0, then one of c;j2, i # 1, will have value one and the message is ae (i,2). If 


Cig = 1, then the message is zero. Thus, we get 
— max a (1,2) 1 + maxa (i, 2) 
iAl iAl 
and send the coefficient — max a (7,2). This means that the message from c12 to the other 
iAl 


constraint node is 6(1,2) = wiz — maxa (i, 2). 


The alpha message is calculated in a similar fashion. If c12 = 0, then one of cı; will 
have value one and the message is más B (1,7). If c12 = 1, then the message is zero. Thus, 
j 


the coefficient — mo (1, j) is sent. This means that a(1,2) = w12 — max a (1, 7). 
Al j 


To prove convergence, we unroll the constraint graph to form a tree with a constraint 
node as the root. In the unrolled graph a variable node such as cı will appear a number 
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Figure 9.8: Portion of factor graph for the maximum weight matching problem. 


of times which depends on how deep a tree is built. Each occurrence of a variable such 
as C12 is deemed to be a distinct variable. 


Lemma 9.14 Ifthe tree obtained by unrolling the graph is of depth k, then the messages 
to the root are the same as the messages in the constraint graph after k-iterations. 


Proof: Straightforward. A 


Define a matching in the tree to be a set of vertices so that there is exactly one variable 
node of the match adjacent to each constraint. Let A denote the vertices of the matching. 
Heavy circles in Figure 9.9 represent the nodes of the above tree that are in the matching A. 


Let II be the vertices corresponding to maximum weight matching edges in the bi- 
partite graph. Recall that vertices in the above tree correspond to edges in the bipartite 
graph. The vertices of II are denoted by dotted circles in the above tree. 


Consider a set of trees where each tree has a root that corresponds to one of the con- 
straints. If the constraint at each root is satisfied by the edge of the MWM, then we have 
found the MWM. Suppose that the matching at the root in one of the trees disagrees 
with the MWM. Then there is an alternating path of vertices of length 2k consisting of 
vertices corresponding to edges in II and edges in A. Map this path onto the bipartite 
graph. In the bipartite graph the path will consist of a number of cycles plus a simple 
path. If k is large enough there will be a large number of cycles since no cycle can be of 
length more than 2n. Let m be the number of cycles. Then m > E = E, 

Let 7* be the MWM in the bipartite graph. Take one of the cycles and use it as an 
alternating path to convert the MWM to another matching. Assuming that the MWM 
is unique and that the next closest matching is e less, W,, —W, > € where m is the new 
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Figure 9.9: Tree for MWM problem. 




















matching. 


Consider the tree matching. Modify the tree matching by using the alternating path 
of all cycles and the left over simple path. The simple path is converted to a cycle by 
adding two edges. The cost of the two edges is at most 2w* where w* is the weight of the 
maximum weight edge. Each time we modify A by an alternating cycle, we increase the 
cost of the matching by at least e. When we modify A by the left over simple path, we 
increase the cost of the tree matching by e — 2wx since the two edges that were used to 
create a cycle in the bipartite graph are not used. Thus 


weight of A - weight of A’ > Le — 2w 


which must be negative since A’ is optimal for the tree. However, if k is large enough this 
becomes positive, an impossibility since A’ is the best possible. Since we have a tree, there 
can be no cycles, as messages are passed up the tree, each subtree is optimal and hence the 
total tree is optimal. Thus the message passing algorithm must find the maximum weight 
matching in the weighted complete bipartite graph assuming that the maximum weight 
matching is unique. Note that applying one of the cycles that makes up the alternating 
path decreased the bipartite graph match but increases the value of the tree. However, 
it does not give a higher tree matching, which is not possible since we already have the 
maximum tree matching. The reason for this is that the application of a single cycle does 
not result in a valid tree matching. One must apply the entire alternating path to go from 
one matching to another. 


355 






































Figure 9.10: warning propagation 


9.17 Warning Propagation 


Significant progress has been made using methods similar to belief propagation in 
finding satisfying assignments for 3-CNF formulas. Thus, we include a section on a 
version of belief propagation, called warning propagation, that is quite effective in finding 
assignments. Consider a factor graph for a SAT problem (Figure 9.10). Index the variables 
by i, 7, and k and the factors by a, b, and c. Factor a sends a message Mai to each variable i 
that appears in the factor a called a warning. The warning is 0 or 1 depending on whether 
or not factor a believes that the value assigned to 2 is required for a to be satisfied. A 
factor a determines the warning to send to variable ¿ by examining all warnings received 
by other variables in factor a from factors containing them. 


For each variable j, sum the warnings from factors containing j that warn j to take 
value T and subtract the warnings that warn 7 to take value F. If the difference says that 
j should take value T or F and this value for variable j does not satisfy a, and this is 
true for all j, then a sends a warning to 7 that the value of variable 7 is critical for factor a. 


Start the warning propagation algorithm by assigning 1 to a warning with probability 
1/2. Iteratively update the warnings. If the warning propagation algorithm converges, 
then compute for each variable 7 the local field h; and the contradiction number c;. The 
local field h; is the number of clauses containing the variable 7 that sent messages that 
i should take value T minus the number that sent messages that 7 should take value F. 
The contradiction number c; is 1 if variable i gets conflicting warnings and 0 otherwise. 
If the factor graph is a tree, the warning propagation algorithm converges. If one of the 
warning messages is one, the problem is unsatisfiable; otherwise it is satisfiable. 


9.18 Correlation Between Variables 


In many situations one is interested in how the correlation between variables drops off 
with some measure of distance. Consider a factor graph for a 3-CNF formula. Measure 
the distance between two variables by the shortest path in the factor graph. One might 
ask if one variable is assigned the value true, what is the percentage of satisfying assign- 
ments of the 3-CNF formula in which the second variable also is true. If the percentage 
is the same as when the first variable is assigned false, then we say that the two variables 
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are uncorrelated. How difficult it is to solve a problem is likely to be related to how fast 
the correlation decreases with distance. 


Another illustration of this concept is in counting the number of perfect matchings 
in a graph. One might ask what is the percentage of matching in which some edge is 
present and ask how correlated this percentage is with the presences or absence of edges 
at some distance d. One is interested in whether the correlation drops off with distance. 
To explore this concept we consider the Ising model studied in physics. 


As mentioned earlier, the Ising or ferromagnetic model is a pairwise random Markov 
field. The underlying graph, usually a lattice, assigns a value of +1, called spin, to the 
variable at each vertex. The probability (Gibbs measure) of a given configuration of spins 


is proportional to exp(B Y) 2,2) = [| ef% where x; = +1 is the value associated 
(1,5) €E (1,j)€E 
with vertex i. Thus 
1 ine 
p (£1, Ta). Tn) = 3 Il expla) = ze OP 


(1,3) €E 


where Z is a normalization constant. 


The value of the summation is simply the difference in the number of edges whose 
vertices have the same spin minus the number of edges whose vertices have opposite spin. 
The constant ( is viewed as inverse temperature. High temperature corresponds to a low 
value of 3 and low temperature corresponds to a high value of 8. At high temperature, 
low 6, the spins of adjacent vertices are uncorrelated whereas at low temperature adjacent 


vertices have identical spins. The reason for this is that the probability of a configuration 
BY) 22; 
is proportional to e ‘~/ "As B is increased, for configurations with a large number of 
BY aia; 
edges whose vertices have identical spins, e 4 ” increases more than for configurations 


whose edges have vertices with non-identical spins. When the normalization constant 5 
is adjusted for the new value of 5, the highest probability configurations are those where 
adjacent vertices have identical spins. 


Given the above probability distribution, what is the correlation between two variables 
x; and zj? To answer this question, consider the probability that x; = +1 as a function 
of the probability that x; = +1. If the probability that x; = +1 is i independent of the 
value of the probability that xz; = +1, we say the values are uncorrelated. 


Consider the special case where the graph G is a tree. In this case a phase transition 
occurs at Bo = 3 In a where d is the degree of the tree. For a sufficiently tall tree and for 
B > Bo, the probability that the root has value +1 is bounded away from 1/2 and depends 
on whether the majority of leaves have value +1 or -1. For 6 < y the probability that 


the root has value +1 is 1/2 independent of the values at the leaves of the tree. 
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Consider a height one tree of degree d. If i of the leaves have spin +1 and d — i have 
spin -1, then the probability of the root having spin +1 is proportional to 


cib-(d-B — el2i-d)6 
If the probability of a leaf being +1 is p, then the probability of 7 leaves being +1 and 


d — i being -1 is 
UN i 
(o -p° 


Thus, the probability of the root being +1 is proportional to 


E E etna 


i=1 


and the probability of the root being —1 is proportional to 


d 
d\ ; —i —(2i— 
B-D (ja, a 





i=1 
=e |p epe] 

The probability of the root being +1 is 

(a 

[pe?®+1-p]°+[p+(1—p)e?4] 
where 
C= [pe +1 — p]" 

and 


D = [pe +1-p]"+ [p+ (1—p) e24]*. 


At high temperature, low 5, the probability q of the root of the height one tree being 
+1 in the limit as P goes to zero is 
p+1l-p 1 


1“ peia pre 2 





independent of p. At low temperature, high 8, 


pleti p° l 





q~ pie2Bd + (1 — p)2e28d ~ ptt (1 — py = 


q goes from a low probability of +1 for p below 1/2 to high probability of +1 for p above 
172: 


Now consider a very tall tree. If the p is the probability that a root has value +1, 
we can iterate the formula for the height one tree and observe that at low temperature 
the probability of the root being one converges to some value. At high temperature, the 
probability of the root being one is 1/2 independent of p. At the phase transition, the slope 
of q at p=1/2 is one. See Figure 9.11. 


Now the slope of the probability of the root being 1 with respect to the probability of 
a leaf being 1 in this height one tree is 
aC aD 
dq _ Pay Cp 


Op D? 





Since the T of the function q(p) at p=1/2 when the phase transition occurs is one, we 


























can solve 2 > = 1 for the value of B where the phase transition occurs. First, we show that 
oP = 0 
Op p= 5 i 
=p” +1— pl” + [p+ (1 —p) eA)" 
= d [pe? +1—p]*" (2-1) +d [p+(1-p)e] 7 (1— €?*) 
Bl T (e) + at [14 0] 2) 0 
Then 
aj _ DS - CB) EY) ape 41 -v]*2-1) 
OP |p 1 D? foi DY 1 perro + [p+ —p)e?I"| a 
r= 3 r= 3 r= 3 
ageya) ler) 
pago pa ire 
Setting 
d(e?” — 1) E 
1e 


And solving for P yields 


To complete the argument, we need to show that q is a monotonic function of p. To see 
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Figure 9.11: Shape of q as a function of p for the height one tree and three values of 8 
corresponding to low temperature, the phase transition temperature, and high tempera- 
ture. 


this, write q = TEA A is a monotonically increasing function of p and B is monotonically 
82 


A 
decreasing. From this it follows that q is monotonically increasing. 


In the iteration going from p to q, we do not get the true marginal probabilities at 
each level since we ignored the effect of the portion of the tree above. However, when we 
get to the root, we do get the true marginal for the root. To get the true marginal’s for 
the interior nodes we need to send messages down from the root. 

. . .1. . . . . B p> Pity 
Note: The joint probability distribution for the tree is of the forme “<9 = J| er% 
(4,j)EE 
Suppose xı has value 1 with probability p. Then define a function vy, called evidence, such 
that 


= |p for xı = 1 
P= l l—p forx=-—1 
=(p=2)u+3 
and multiply the joint probability function by y. Note, however, that the marginal prob- 
ability of xı is not p. In fact, it may be further from p after multiplying the conditional 
probability function by the function y. 


9.19 Bibliographic Notes 


A formal definition of Topic Models described in this chapter as well as the LDA model 
are from Blei, Ng and Jordan [BNJ03]; see also [Ble12]. Non-negative Matrix Factoriza- 
tion has been used in several contexts, for example [DS03]. Anchor terms were defined 
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and used in [AGKM16]. Sections 9.7 - 9.9 are simplified versions of results from [BBK14]. 


Good introductions to hidden Markov models, graphical models, Bayesian networks, 
and belief propagation appear in [Gha01, Bis06]. The use of Markov Random Fields 
for computer vision originated in the work of Boykov, Veksler, and Zabih [BVZ98], and 
further discussion appears in [Bis06]. Factor graphs and message-passing algorithms on 
them were formalized as a general approach (incorporating a range of existing algorithms) 
in [KFLO1]. Message-passing in graphs with loops is discussed in [Wei97, WFO01]. 


The use of belief propagation for maximum weighted matching is from [BSS08]. Sur- 
vey propagation and warning propagation for finding satisfying assignments to k-CNF 
formulas are described and analyzed in [MPZ02, BMZ05, ACORT11]. For additional 
relevant papers and surveys, see [FD07, YFW01, YFW03, FKOO]. 
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9.20 Exercises 


Exercise 9.1 Find a non-negative factorization of the matrix 


4 6 5 
12 3 
A=1|]7 10 7 
6 8 4 
6 10 11 


Indicate the steps in your method and show the intermediate results. 


Exercise 9.2 Find a non-negative factorization of each of the following matrices. 


10 9 15 14 13 5 5 10 14 17 
2133 1 22 4 4 6 
8 7 13 11 11 112 4 4 
(1) | 7 5 11 10 7 DIETA 
5 5 11 6 11 33 6 & 10 
1 1 3 1 3 5 5 10 16 18 
2 2 2 2 22 4 6 7 
4 4 3 3 1 3 4 3 
13 16 13 10 5 13 14 10 
15 24 21 12 9 21 18 12 ; a i is : : : 
(3) 7 16 15 6 7 15 10 6 (4) 
6 6 12 16 15 15 4 
1 4 4 12 4 2 1 33 3 4 3 3 1 
5 8&8 7 437 6 A 
3 12 12 3 6 12 6 3 
Exercise 9.3 Consider the matrix A that is the product of non-negative matrices B and 
e 12 22 41 35 10 1 
19 20 13 48] [| 1 9 12 4 8 
11, 1416-29 (ES: Cae) 
14 16 14 36 


Which rows of A are approximate positive linear combinations of other rows of A? 
Find an approxiamte non-negative factorization of A 


Exercise 9.4 Consider a set of vectors S in which no vector is a positive linear combi- 
nation of the other vectors in the set. Given a set T containing S along with a number of 
elements from the convex hull of S find the vectors in S. Develop an efficient method to 
find S from T which does not require linear programming. 

Hint: The points of S are vertices of the convex hull of T. The Euclidean length of a 
vector is a convex function and so its maximum over a polytope is attained at one of the 
vertices. Center the set and find the the maximum length vector in T. This will be one 
element of S. 
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Exercise 9.5 Define the non-negative column rank of am x n matrix A to be the min- 
imum number of vectors of in m space with the property that every column of A can be 
expressed as a non-negative linear combination of these vectors. 


1. Show that the non-negative column rank of A is at least the rank of A. 


2. Construct a 3 x n matrix whose non-negative column rank is n. [Hint: Take the 
plane x+y = z = 1 in 3— space; draw a circle in the plane and take n points on 
the circle.] 


3. Show that the non-negative column rank need not be the same as the non-negative 
row rank. 


4. Read/look up a paper of Vavasis showing that the computation of non-negative rank 
is NP-hard. 


Exercise 9.6 What happens to the Topic Modeling problem, when m the number of words 
in a document goes to infinity? Argue that the Idealized Topic Modeling problem of Section 
9.2 is easy to solve when m goes to infinity. 


Exercise 9.7 Suppose y = (Y1, Yo,---,Yr) ts jointly distributed according to the Dirichlet 
distribution with parameter u = 1/r. Show that the expected value of maxj_,y, is greater 
than 0.1. [Hint: Lemma 9.6] 


Exercise 9.8 Suppose there are s documents in a collection which are all nearly pure 
for a particular topic. I.e., in each of these documents, that topic has weight at least 
1— ô. Suppose someone finds and hands to you these documents. Then their average is 
an approximation to the topic vector. In terms of s,m and 06 compute an upper bound on 
the error of approximation. 


We could suggest at the start of Section 9.7 that they do the following exercise before 
reading the section, so they get the intuition. 


Exercise 9.9 Two topics and two words. Toy case of the “Dominant Admixture Model”. 
Suppose in a topic model, there are just two words and two topics; word 1 is a “key word” 
of topic 1 and word 2 is a key word of topic 2 in the sense: 


1 
bii È 2b2 ; ba < 3022 
Suppose each document has one of the two topics as a dominant topic in the sense: 
Max(ci;, Cay) > 0.75. 


Also suppose 
(A -— BC)y|<0.1 Vi, j. 


Show that there are two real numbers 1 and uz such that each document j with dominant 
topic 1 has ay; > pu and ag; < u2 and each document j' with dominant topic 2 has 
azy > u2 and ayy <p 1. 
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Exercise 9.10 What is the probability of heads occurring after a sufficiently long sequence 
of transitions in Viterbi algorithm example of the most likely sequence of states? 


Exercise 9.11 Find optimum parameters for a three state HMM and given output se- 
quence. Note the HMM must have a strong signature in the output sequence or we prob- 
ably will not be able to find it. The following example may not be good for that 
reason. 











1-2 3 A B 
eee Pare 
WEF pE s 
IEE ier 


Exercise 9.12 /n the Ising model for a tree of degree one, a chain of vertices, is there a 
phase transition where the correlation between the value at the root and the value at the 
leaves becomes independent? Work out mathematical what happens. 


Exercise 9.13 For a Boolean function in CNF the marginal probability gives the number 
of satisfiable assignments with x;. 


How does one obtain the number of satisfying assignments for a 2-CNF formula? Not 
completely related to first sentence. 
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10 Other Topics 


10.1 Ranking and Social Choice 


Combining feedback from multiple users to rank a collection of items is an important 
task. We rank movies, restaurants, web pages, and many other items. Ranking has be- 
come a multi-billion dollar industry as organizations try to raise the position of their web 
pages in the results returned by search engines to relevant queries. Developing a method 
of ranking that cannot be easily gamed by those involved is an important task. 


A ranking of a collection of items is defined as a complete ordering. For every pair of 
items a and b, either a is preferred to b or b is preferred to a. Furthermore, a ranking is 
transitive in that a > b and b > c implies a > c. 


One problem of interest in ranking is that of combining many individual rankings into 
one global ranking. However, merging ranked lists in a meaningful way is non-trivial as 
the following example illustrates. 


Example: Suppose there are three individuals who rank items a, b, and c as illustrated 
in the following table. 














individual | first item | second item | third item 
1 a b c 
2 b c 
3 c a b 




















Suppose our algorithm tried to rank the items by first comparing a to b and then 
comparing b to c. In comparing a to b, two of the three individuals prefer a to b and thus 
we conclude a is preferable to b. In comparing b to c, again two of the three individuals 
prefer b to c and we conclude that b is preferable to c. Now by transitivity one would 
expect that the individuals would prefer a to c, but such is not the case, only one of the 
individuals prefers a to c and thus c is preferable to a. We come to the illogical conclusion 
that a is preferable to b, b is preferable to c, and c is preferable to a. E 


Suppose there are a number of individuals or voters and a set of candidates to be 
ranked. Each voter produces a ranked list of the candidates. From the set of ranked lists 
can one construct a reasonable single ranking of the candidates? Assume the method of 
producing a global ranking is required to satisfy the following three axioms. 


Non-dictatorship — The algorithm cannot always simply select one individual's ranking 
to use as the global ranking. 


Unanimity - If every individual prefers a to b, then the global ranking must prefer a to 


b. 
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Independent of irrelevant alternatives — If individuals modify their rankings but 
keep the order of a and b unchanged, then the global order of a and b should 
not change. 


Arrow showed that it is not possible to satisfy all three of the above axioms. We begin 
with a technical lemma. 


Lemma 10.1 For a set of rankings in which each individual ranks an item b either first 
or last (some individuals may rank b first and others may rank b last), a global ranking 
satisfying the above axioms must put b first or last. 


Proof: Let a, b, and c be distinct items. Suppose to the contrary that b is not first or 
last in the global ranking. Then there exist a and c where the global ranking puts a > b 
and b > c. By transitivity, the global ranking puts a > c. Note that all individuals can 
move c above a without affecting the order of b and a or the order of b and c since b 
was first or last on each list. Thus, by independence of irrelevant alternatives, the global 
ranking would continue to rank a > b and b > c even if all individuals moved c above a 
since that would not change the individuals relative order of a and 6 or the individuals 
relative order of b and c. But then by unanimity, the global ranking would need to put 
c> a, a contradiction. We conclude that the global ranking puts b first or last. A 


Theorem 10.2 (Arrow) Any deterministic algorithm for creating a global ranking from 
individual rankings of three or more elements in which the global ranking satisfies una- 
nimity and independence of irrelevant alternatives is a dictatorship. 


Proof: Let a, b, and c be distinct items. Consider a set of rankings in which every in- 
dividual ranks b last. By unanimity, the global ranking must also rank b last. Let the 
individuals, one by one, move b from bottom to top leaving the other rankings in place. 
By unanimity, the global ranking must eventually move b from the bottom all the way to 
the top. When b first moves, it must move all the way to the top by Lemma 10.1. 


Let v be the first individual whose change causes the global ranking of b to change. 
We argue that v is a dictator. First, we argue that v is a dictator for any pair ac not 
involving b. We will refer to the three rankings of v in Figure 10.1. The first ranking 
of v is the ranking prior to v moving b from the bottom to the top and the second is 
the ranking just after v has moved b to the top. Choose any pair ac where a is above 
c in v’s ranking. The third ranking of v is obtained by moving a above b in the second 
ranking so that a > b > c in v’s ranking. By independence of irrelevant alternatives, 
the global ranking after v has switched to the third ranking puts a > b since all indi- 
vidual ab votes are the same as in the first ranking, where the global ranking placed 
a > b. Similarly b > c in the global ranking since all individual bc votes are the same 
as in the second ranking, in which b was at the top of the global ranking. By transitiv- 
ity the global ranking must put a > c and thus the global ranking of a and c agrees with v. 
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Figure 10.1: The three rankings that are used in the proof of Theorem 10.2. 


Now all individuals except v can modify their rankings arbitrarily while leaving b in its 
extreme position and by independence of irrelevant alternatives, this does not affect the 
global ranking of a > b or of b > c. Thus, by transitivity this does not affect the global 
ranking of a and c. Next, all individuals except v can move b to any position without 
affecting the global ranking of a and c. 


At this point we have argued that independent of other individuals’ rankings, the 
global ranking of a and c will agree with v’s ranking. Now v can change its ranking 
arbitrarily, provided it maintains the order of a and c, and by independence of irrelevant 
alternatives the global ranking of a and c will not change and hence will agree with v. 
Thus, we conclude that for all a and c, the global ranking agrees with v independent of 
the other rankings except for the placement of b. But other rankings can move b without 
changing the global order of other elements. Thus, v is a dictator for the ranking of any 
pair of elements not involving b. 


Note that v changed the relative order of a and b in the global ranking when it moved 
b from the bottom to the top in the previous argument. We will use this in a moment. 


To show that individual v is also a dictator over every pair ab, repeat the construction 
showing that v is a dictator for every pair ac not involving b only this time place c at 
the bottom. There must be an individual ve who is a dictator for any pair such as ab not 
involving c. Since both v and ve can affect the global ranking of a and b independent of 
each other, it must be that ve is actually v. Thus, the global ranking agrees with v no 
matter how the other voters modify their rankings. E 


10.1.1 Randomization 


An interesting randomized algorithm that satisfies unanimity and independence of irrel- 
evant alternatives is to pick a random individual and use that individual’s ranking as 
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the output. This is called the “random dictator” rule because it is a randomization over 
dictatorships. An analogous scheme in the context of voting would be to select a winner 
with probability proportional to the number of votes for that candidate, because this is 
the same as selecting a random voter and telling that voter to determine the winner. Note 
that this method has the appealing property that as a voter, there is never any reason 
to strategize, e.g., voting for candidate a rather than your preferred candidate b because 
you think b is unlikely to win and you don’t want to throw away your vote. With this 
method, you should always vote for your preferred candidate. 


10.1.2 Examples 


Borda Count: Suppose we view each individual’s ranking as giving each item a score: 
putting an item in last place gives it one point, putting it in second-to-last place gives it 
two points, third-to-last place is three points, and so on. In this case, one simple way to 
combine rankings is to sum up the total number of points received by each item and then 
sort by total points. This is called the extended Borda Count method. 


Let’s examine which axioms are satisfied by this approach. It is easy to see that it 
is a non-dictatorship. It also satisfies unanimity: if every individual prefers a to b, then 
every individual gives more points to a than to b, and so a will receive a higher total than 
b. By Arrow’s theorem, the approach must fail independence of irrelevant alternatives, 
and indeed this is the case. Here is a simple example with three voters and four items 
{a, b,c,d} where the independence of irrelevant alternatives axiom fails: 














individual | ranking 
1 abcd 
2 abcd 
3 bacd 














In this example, a receives 11 points and is ranked first, b receives 10 points and is ranked 
second, c receives 6 points and is ranked third, and d receives 3 points and is ranked 
fourth. However, if individual 3 changes his ranking to bcda, then this reduces the total 
number of points received by a to 9, and so b is now ranked first overall. Thus, even 
though individual 3’s relative order of b and a did not change, and indeed no individual’s 
relative order of b and a changed, the global order of b and a did change. 


Hare voting: An interesting system for voting is to have everyone vote for their fa- 
vorite candidate. If some candidate receives a majority of the votes, he or she is declared 
the winner. If no candidate receives a majority of votes, the candidate with the fewest 
votes is dropped from the slate and the process is repeated. 


The Hare system implements this method by asking each voter to rank all the can- 


didates. Then one counts how many voters ranked each candidate as number one. If no 
candidate receives a majority, the candidate with the fewest number one votes is dropped 
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from each voters ranking. If the dropped candidate was number one on some voters list, 
then the number two candidate becomes that voter’s number one choice. The process of 
counting the number one rankings is then repeated. 


We can convert the Hare voting system into a ranking method in the following way. 
Whichever candidate is dropped first is put in last place, whichever is dropped second is 
put in second-to-last place, and so on, until the system selects a winner, which is put in 
first place. The candidates remaining, if any, are placed between the first-place candidate 
and the candidates who were dropped, in an order determined by running this procedure 
recursively on just those remaining candidates. 


As with Borda Count, the Hare system also fails to satisfy independence of irrelevant 
alternatives. Consider the following situation in which there are 21 voters that fall into 
four categories. Voters within a category rank individuals in the same order. 

















Category Number gevorm Preference order 
in category 
1 f abcd 
2 6 bacd 
3 5 cbad 
4 3 dcba 

















The Hare system would first eliminate d since d gets only three rank one votes. Then 
it would eliminate b since b gets only six rank one votes whereas a gets seven and c gets 
eight. At this point a is declared the winner since a has thirteen votes to c’s eight votes. 
So, the final ranking is acbd. 


Now assume that Category 4 voters who prefer b to a move b up to first place. This 
keeps their order of a and b unchanged, but it reverses the global order of a and b. In 
particular, d is first eliminated since it gets no rank one votes. Then c with five votes is 
eliminated. Finally, b is declared the winner with 14 votes, so the final ranking is bacd. 


Interestingly, Category 4 voters who dislike a and have ranked a last could prevent a 
from winning by moving a up to first. Ironically this results in eliminating d, then c, with 
five votes and declaring b the winner with 11 votes. Note that by moving a up, category 
4 voters were able to deny a the election and get b to win, whom they prefer over a. 


10.2 Compressed Sensing and Sparse Vectors 


Define a signal to be a vector x of length d, and define a measurement of x to be a dot- 
product of x with some known vector a;. If we wish to uniquely reconstruct x without 
any assumptions, then d linearly-independent measurements are necessary and sufficient. 
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Figure 10.2: Ax = b has a vector space of solutions but possibly only one sparse 
solution. If the columns of A are unit length vectors that are pairwise nearly orthogonal, 
then the system has a unique sparse solution. 


Given b = Ax where A is known and invertible, we can reconstruct x as x = A7!b. In 
the case where there are fewer than d independent measurements and the rank of A is less 
than d, there will be multiple solutions. However, if we knew that x is sparse with s < d 
non-zero elements, then we might be able to reconstruct x with far fewer measurements 
using a matrix A with n < d rows. See Figure 10.2. In particular, it turns out that 
a matrix A whose columns are nearly orthogonal, such as a matrix of random Gaussian 
entries, will be especially well-suited to this task. This is the idea of compressed sensing. 
Note that we cannot make the columns of A be completely orthogonal since A has more 
columns than rows. 


Compressed sensing has found many applications, including reducing the number of 
sensors needed in photography, using the fact that images tend to be sparse in the wavelet 
domain, and in speeding up magnetic resonance imaging in medicine. 


10.2.1 Unique Reconstruction of a Sparse Vector 


A vector is said to be s-sparse if it has at most s non-zero elements. Let x be a 
d-dimensional, s-sparse vector with s < d. Consider solving Ax = b for x where A is an 
n x d matrix with n < d. The set of solutions to Ax = b is a subspace. However, if we 
restrict ourselves to sparse solutions, under certain conditions on A there is a unique s- 
sparse solution. Suppose that there were two s-sparse solutions, xy and xg. Then x1 — X2 
would be a 2s-sparse solution to the homogeneous system Ax = 0. A 2s-sparse solution to 
the homogeneous equation Ax = O requires that some 2s columns of A be linearly depen- 
dent. Unless A has 2s linearly dependent columns there can be only one s-sparse solution. 


The solution to the reconstruction problem is simple. If the matrix A has at least 2s 
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+ 2-norm solution 


P 


Figure 10.3: Illustration of minimum 1-norm and 2-norm solutions. 








y 1-norm solution a 








rows and the entries of A were selected at random from a standard Gaussian, then with 
probability one, no set of 2s columns will be linearly dependent. We can see this by not- 
ing that if we first fix a subset of 2s columns and then choose the entries at random, the 
probability that this specific subset is linearly dependent is the same as the probability 
that 2s random Gaussian vectors in a 2s-dimensional space are linearly dependent, which 
is zero. So, taking the union bound over all ee ) subsets, the probability that any one 
of them is linearly dependent is zero. 


The above argument shows that if we choose n = 2s and pick entries of A randomly 
from a Gaussian, with probability one there will be a unique s-sparse solution. Thus, 
to solve for x we could try all (“) possible locations for the non-zero elements in x and 
aim to solve Ax = b over just those s columns of A: any one of these that gives a 
solution will be the correct answer. However, this takes time Q(d*) which is exponential 
in s. We turn next to the topic of efficient algorithms, describing a polynomial-time 
optimization procedure that will find the desired solution when n is sufficiently large and 
A is constructed appropriately. 


10.2.2 Efficiently Finding the Unique Sparse Solution 


To find a sparse solution to Ax = b, one would like to minimize the zero norm ||x||, 
over {x|Ax = b}, i.e., minimize the number of non-zero entries. This is a computationally 
hard problem. There are techniques to minimize a convex function over a convex set, but 
||x||9 is not a convex function, and with no further assumptions, it is NP-hard. With this 
in mind, we use the one-norm as a proxy for the zero-norm and minimize the one-norm 
|x|], =>, |wi| over {x|Ax = b}. Although this problem appears to be non-linear, it can 
be solved by linear programming by writing x = u—v, u > 0, and v > 0, and minimizing 
the linear function >> u; + >> v; subject to Au-Av=b, u > 0, and v > 0. 





42This can be seen by selecting the vectors one at a time. The probability that the i£” new vector lies 
fully in the lower dimensional subspace spanned by the previous i — 1 vectors is zero, and so by the union 
bound the overall probability is zero. 
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We now show if the columns of the n by d matrix A are unit length almost orthogo- 
1 1 


nal vectors with pairwise dot products in the range (—>;, 35) that minimizing ||x||, over 
{x|Ax = b} recovers the unique s-sparse solution to Ax=b. The ij” element of the ma- 
trix ATA is the cosine of the angle between the it and ¿'” columns of A. If the columns 
of A are unit length and almost orthogonal, 47 A will have ones on its diagonal and all 
off diagonal elements will be small. By Theorem 2.8, if A has n = s? log d rows and each 
column is a random unit-length n-dimensional vector, with high probability all pairwise 
dot-products will have magnitude less than x as desired.*® Here, we use s? log d, a larger 
value of n compared to the existence argument in Section 10.2.1, but now the algorithm 


is computationally efficient. 


Let xo denote the unique s-sparse solution to Ax = b and let xı be a solution of 
smallest possible one-norm. Let z = Xı — Xp. We now prove that z = 0 implying that 
xı = Xo. First, Az = Axı — Axp = b — b = O. This implies that AT Az = 0. Since each 
column of A is unit length, the matrix 47 A has ones on its diagonal. Since every pair of 
distinct columns of A has dot-product in the range (—s, x), each off-diagonal entry in 
AT A is in the range (—£, x). These two facts imply that unless z = 0, every entry in z 
must have absolute value less than +||z||;. If the ¿ entry in z had absolute value greater 
than or equal to 3||z||;, it would not be possible for the j entry of AT Az to equal 0 


unless ||z||, = 0. 


Finally let S denote the support of x9, where |S| < s. We now argue that z must 
have at least half of its ¢; norm inside of S, i.e., > ¿es |z| > $||z\|1. This will complete 
the argument because it implies that the average value of |z,| for j € S is at least +||z||1, 
which as shown above is only possible if ||z||; = 0. Let tin denote the sum of the absolute 
values of the entries of x, in the set S, and let tout denote the sum of the absolute values 
of the entries of xı outside of S. So, tin + tout = ||X1||1- Let to be the one-norm of xo. 
Since x; is the minimum one norm solution, to > tin + tous, or equivalently to — tin > tout- 
But Y jes 1251 > to — tin and Y jas 125] = touw. This implies that > ¿25 |z| > >jes | 25l, Or 
equivalently, Xjes |2j| > 5llzll1, which as noted above implies that ||z||, = 0, as desired, 

A 


To summarize, we have shown the following theorem and corollary. 


Theorem 10.3 If matriz A has unit-length columns ay,...,aq and the property that 
la; + aj] < = for alli 4 j, then if the equation Ax = b has a solution with at most s 
non-zero coordinates, this solution is the unique minimum 1-norm solution to Ax = b. 
Corollary 10.4 For some absolute constant c, if A has n rows for n > cs? logd and each 
column of A is chosen to be a random unit-length n-dimensional vector, then with high 
probability A satisfies the conditions of Theorem 10.3 and therefore if the equation Ax = b 
has a solution with at most s non-zero coordinates, this solution is the unique minimum 
1-norm solution to Ax = b. 





13 Note that the roles of “n” and “d” are reversed here compared to Theorem 2.8. 
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Figure 10.4: The system of linear equations used to find the internal code for some 
observable phenomenon. 


The condition of Theorem 10.3 is often called incoherence of the matrix A. Other more 
involved arguments show that it is possible to recover the sparse solution using one-norm 
minimization for a number of rows n as small as O(s log(ds)). 


10.3 Applications 
10.3.1 Biological 


There are many areas where linear systems arise in which a sparse solution is unique. 
One is in plant breeding. Consider a breeder who has a number of apple trees and for 
each tree observes the strength of some desirable feature. He wishes to determine which 
genes are responsible for the feature so he can crossbreed to obtain a tree that better 
expresses the desirable feature. This gives rise to a set of equations Ax = b where each 
row of the matrix A corresponds to a tree and each column to a position on the genone. 
See Figure 10.4. The vector b corresponds to the strength of the desired feature in each 
tree. The solution x tells us the position on the genone corresponding to the genes that 
account for the feature. It would be surprising if there were two small independent sets 
of genes that accounted for the desired feature. Thus, the matrix should have a property 
that allows only one sparse solution. 
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10.3.2 Low Rank Matrices 


Suppose L is a low rank matrix that has been corrupted by noise. That is, A= L+ R. 
If the R is Gaussian, then principal component analysis will recover L from A. However, 
if L has been corrupted by several missing entries or several entries have a large noise 
added to them and they become outliers, then principal component analysis may be far 
off. However, if L is low rank and R is sparse, then L can be recovered effectively from 
L+R. To do this, find the L and R that minimize ||L||, +A || R||,.“* Here the nuclear norm 
IL 1], is the 1-norm of the vector of singular values of L and ||R||¡ is the entrywise 1-norm 
> Iry]. A small value of [| L[|, indicates a sparse vector of singular values and hence a 
low rank matrix. Minimizing ||L]|, + AR |] subject to L + R = A is a complex problem 
and there has been much work on it. The reader is referred to Add references 
Notice that we do not need to know the rank of L or the elements that were corrupted. 
All we need is that the low rank matrix L is not sparse and that the sparse matrix R is 
not low rank. We leave the proof as an exercise. 


If A is a small matrix one method to find L and R by minimizing || Z|], + || Rl], is to 
find the singular value decomposition A = UNV? and minimize ||2||, + ||R||, subject to 
A= L+ R and UEV” being the singular value decomposition of A. This can be done 
using Lagrange multipliers (??). Write R= Rt + R~ where Rt > 0 and R` > 0. Let 


Host) No, + Jor + E 
i=1 ij ij 
Write the Lagrange formula 
l= foi, rij) + 0149: 

where the g; are the required constraints 

1. ro 20 

2. rj 20 

3. Oi > 0 


4. Qij = lij + Tij 


1 i=j 
Tepic 
9. Uj Uj Cee 


1 i=j 
Pote 
6. v; vj T 


T. li = D uia? 





44T> minimize the absolute value of x write x = u — v and using linear programming minimize u + v 
subject to u > 0 and v > 0. 
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Conditions (5) and (6) insure that UNV” is the svd of some matrix. The solution is 
obtained when V(1) = 0 which can be found by gradient descent using V?(1). 


An example where low rank matrices that have been corrupted might occur is aerial 
photographs of an intersection. Given a long sequence of such photographs, they will be 
the same except for cars and people. If each photo is converted to a vector and the vector 
used to make a column of a matrix, then the matrix will be low rank corrupted by the 
traffic. Finding the original low rank matrix will separate the cars and people from the 
back ground. 


10.4 An Uncertainty Principle 


Given a function x(t), one can represent the function by the composition of sinusoidal 
functions. Basically one is representing the time function by its frequency components. 
The transformation from the time representation of a function to it frequency represen- 
tation is accomplished by a Fourier transform. The Fourier transform of a function x(t) 
is given by 


fw) = faenas 


Converting the frequency representation back to the time representation is done by the 
inverse Fourier transformation 


a(t) = [Herido 


In the discrete case, x = [2p, 21,...,%n-1] and f = [fo, fi,..., fri]. The Fourier trans- 
form is f = Ax with aij = Fo where w is the principal n*” root of unity. The inverse 
transform is x = Bf where B = A”! has the simple form bij = Fo. 

There are many other transforms such as the Laplace, wavelets, chirplets, etc. In fact, 
any non-singular n x n matrix can be used as a transform. 


10.4.1 Sparse Vector in Some Coordinate Basis 


Consider Ax = b where A is a square n x n matrix. The vectors x and b can be con- 
sidered as two representations of the same quantity. For example, x might be a discrete 
time sequence, b the frequency spectrum of x, and the matrix A the Fourier transform. 
The quantity x can be represented in the time domain by x and in the frequency domain 
by its Fourier transform b. 


Any orthonormal matrix can be thought of as a transformation and there are many 
important transformations other than the Fourier transformation. Consider a transfor- 
mation A and a signal x in some standard representation. Then y = Ax transforms 
the signal x to another representation y. If A spreads any sparse signal x out so that 
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the information contained in each coordinate in the standard basis is spread out to all 
coordinates in the second basis, then the two representations are said to be incoherent. 
A signal and its Fourier transform are one example of incoherent vectors. This suggests 
that if x is sparse, only a few randomly selected coordinates of its Fourier transform are 
needed to reconstruct x. Below, we show that a signal cannot be too sparse in both its 
time domain and its frequency domain. 


10.4.2 A Representation Cannot be Sparse in Both Time and Frequency 
Domains 


There is an uncertainty principle that states that a time signal cannot be sparse in 
both the time domain and the frequency domain. If the signal is of length n, then the 
product of the number of non-zero coordinates in the time domain and the number of 
non-zero coordinates in the frequency domain must be at least n. This is the mathemati- 
cal version of Heisenberg's uncertainty principle. Before proving the uncertainty principle 
we first prove a technical lemma. 


In dealing with the Fourier transform it is convenient for indices to run from 0 to n— 1 


rather than from 1 to n. Let to, 21,...,%/-1 be a sequence and let fo, f1,..., fn-1 be its 
n—1 2ri n 
discrete Fourier transform. Let 7 = y—1. Then f; = Tr 5 a Te j=0,... ol, 
k=0 
271 ik 


In matrix form f = Zx where zj =€ n 


fo : x 3 2ni : Lo 

fi ar ha e a a ael pa 

ae va] : 
fn-1 1 oon (n — 1) ere (n— 1) oor (n — 1) Tn—1 


If some of the elements of x are zero, delete the zero elements of x and the corresponding 
columns of the matrix. To maintain a square matrix, let ny be the number of non-zero 
elements in x and select n, consecutive rows of the matrix. Normalize the columns of the 
resulting submatrix by dividing each element in a column by the column element in the 
first row. The resulting submatrix is a Vandermonde matrix that looks like 


1 1 1 1 
a b c d 
a? b2 C2 a? 
as bÈ d 
and is non-singular. 
Lemma 10.5 If £o, £1,...,£Zn-1 has nz non-zero elements, then fo, f1,..., fn-1 cannot 


have nz consecutive zeros. 
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DAS REO E at AN 0 1139435 0 
Figure 10.5: The transform of the sequence 100100100. 
Proof: Let 71, 7%2,...,%n, be the indices of the non-zero elements of x. Then the elements 
of the Fourier transform in the range k = m+1,m+2,...,m+ nz are 


f= hE nye n kis 


Note the use of i as y—1 and the multiplication of the exponent by i; to account for the 
actual location of the element in the sequence. Normally, if every element in the sequence 
was included, we would just multiply by the index of summation. 

Convert the equation to matrix form by defining Zkj = oa exp ki;) and write 
f = Zx where now x is the vector consisting of the non-zero elements of the original x. 
By its definition, x 4 0. To prove the lemma we need to show that f is non-zero. This will 
be true provided Z is non-singular since x = Z~'f. If we rescale Z by dividing each column 
by its leading entry we get the Vandermonde determinant which is non-singular. A 


Theorem 10.6 Letn, be the number of non-zero elements in x and let ny be the number 
of non-zero elements in the Fourier transform of x. Let n, divide n. Then nons > n. 


Proof: Ifx has n, non-zero elements, f cannot have a consecutive block of n, zeros. Since 
Ng divides n there are 2 blocks each containing at least one non-zero element. Thus, the 
product of non-zero elements in x and f is at least n. A 


The Fourier transform of spikes proves that above bound is tight 
To show that the bound in Theorem 10.6 is tight we show that the Fourier transform 
of the sequence of length n consisting of yn ones, each one separated by yn — 1 zeros, 


is the sequence itself. For example, the Fourier transform of the sequence 100100100 is 
100100100. Thus, for this class of sequences, ngns = n. 


Theorem 10.7 Let S(yn, yn) be the sequence of 1’s and 0’s with yn 1's spaced yn 
apart. The Fourier transform of S (yn, yn) is itself. 
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Proof: Consider the columns 0, y/n,2yn,...,(yn—1) yn. These are the columns for 
which S (yn, yn) has value 1. The element of the matrix Z in the row ¿yn of column 
kyn, 0 < k < yn is z™ = 1. Thus, the product of these rows of Z times the vector 


S (yn, yn) equals yn and the 1/y/n normalization yields f; = 1. 


For rows whose index is not of the form j y/n, the row b, b 4 jyn,j € {0, yn,...,yn—1), 
the elements in row bin the columns 0, yn, 2\/n,..., (yn — 1) vn are 1, 22,22%, ... ghee 


and thus fy = T (1 He pz. p = =—— = 0 since 224” = 1 and 2? Le 





a 
ms better suited to perhaps a homework question 
10.5 Gradient 
The gradient of a function f(x) of d variables, x = (£1, £2,..., £q), at a point Xo is 
denoted V7 f (xo). It is a d-dimensional vector with components anal, are eee oro) 
where 2L are partial derivatives. Without explicitly stating, we assume that the deriva- 


tives referred to exist. The rate of increase of the function f as we move from xg in a 
direction u is Y f(Xo)- u. So the direction of steepest descent is — Y f (Xo); this is a nat- 
ural direction to move to minimize f. But by how much should we move? A large move 
may overshoot the minimum. See Figure 10.6. A simple fix is to minimize f on the line 
from xo in the direction of steepest descent by solving a one dimensional minimization 
problem. This gives us the next iterate xı and we repeat. We do not discuss the issue 
of step-size any further. Instead, we focus on infinitesimal gradient descent, where, the 
algorithm makes infinitesimal moves in the — Y f (xo) direction. Whenever Yf is not the 
zero vector, we strictly decrease the function in the direction —wyf, so the current point 
is not a minimum of the function f. Conversely, a point x where Vf = 0 is called a 
first-order local optimum of f. A first-order local optimum may be a local minimum, local 
maximum, or a saddle point. We ignore saddle points since numerical error is likely to 
prevent gradient descent from stoping at a saddle point. In general, local minima do not 
have to be global minima, see Figure 10.6, and gradient descent may converge to a local 
minimum that is not a global minimum. When the function f is convex, this is not the 
case. 


A function f of a single variable x is said to be convex if for any two points a and b, 
the line joining f(a) and f(b) is above the curve f(-). A function of many variables is 
convex if on any line segment in its domain, it acts as a convex function of one variable 
on the line segment. 


Definition 10.1 A function f over a convex domain is a convex function if for any two 
points x and y in the domain, and any A in [0,1] we have 


fx + (1—A)y) < Af (x) + (1- ADF (y). 


The function is concave if the inequality is satisfied with > instead of <. 
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Figure 10.6: Gradient descent overshooting minimum 


Theorem 10.8 Suppose f is a convex, differentiable function defined on a closed bounded 
convex domain. Then any first-order local minimum is also a global minimum. Infinites- 
imal gradient descent always reaches the global minimum. 


Proof: We will prove that if x is a local minimum, then it must be a global minimum. 
If not, consider a global minimum point y # x. On the line joining x and y, the function 
must not go above the line joining f(x) and f(y). This means for an infinitesimal e > 0, 
moving distance e from x towards y, the function must decrease, so f(x) is not 0, 
contradicting the assumption that x is a local minimum. A 


The second derivatives Tae form a matrix, called the Hessian, denoted H(f(x)). 





The Hessian of f at x is a symmetric d x d matrix with ijt? entry es -(x). The second 
Ox ;,0x; 


derivative of f at x in the direction u is the rate of change of the first derivative in the 
direction u from x. It is easy to see that it equals 
T 
u` A(f(x))u. 


To see this, note that the second derivative of f along the unit vector u is 


ə ON A 
Na Fe (940) +0) = Dy q (w n] 








Theorem 10.9 Suppose f is a function from a closed convex domain D in R? to the 
reals and the Hessian of f exists everywhere in D. Then f is convex (concave) on D if 
and only if the Hessian of f is positive (negative) semi-definite everywhere on D. 


Gradient descent requires the gradient to exist. But, even if the gradient is not always 


defined, one can minimize a convex function over a convex domain efficiently, i.e., in 
polynomial time. Technically, one can only find an approximate minimum and the time 
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depends on the error parameter as well as the presentation of the convex set. We do not 
go into these details. But, in principle we can minimize a convex function over a convex 
domain. We can also maximize a concave function over a concave domain. However, in 
general, we do not have efficient procedures to maximize a convex function over a convex 
domain. It is easy to see that at a first-order local minimum of a possibly non-convex 
function, the gradient vanishes. But second-order local decrease of the function may be 
possible. The steepest second-order decrease is in the direction of +v, where, v is the 
eigenvector of the Hessian corresponding to the largest absolute valued eigenvalue. 


10.6 Linear Programming 


Linear programming is an optimization problem that has been carefully studied and is 
immensely useful. We consider linear programming problem in the following form where 
A is an m x n matrix, m < n, of rank m, cis 1 x n, bis m x 1, and xisnx1: 


max c-x subject to Ax=b, x> 0. 


Inequality constraints can be converted to this form by adding slack variables. Also, we 
can do Gaussian elimination on A and if it does not have rank m, we either find that 
the system of equations has no solution, whence we may stop or we can find and discard 
redundant equations. After this preprocessing, we may assume that A ’s rows are inde- 
pendent. 


The simplex algorithm is a classical method to solve linear programming problems. It 
is a vast subject and is well discussed in many texts. Here, we will discuss the ellipsoid 
algorithm which is in a sense based more on continuous mathematics and is closer to the 
spirit of this book. 


10.6.1 The Ellipsoid Algorithm 


The first polynomial time algorithm for linear programming* was developed by Khachiyan 
based on work of Iudin, Nemirovsky and Shor and is called the ellipsoid algorithm. The 
algorithm is best stated for the seemingly simpler problem of determining whether there 
is a solution to Ax < b and if so finding one. The ellipsoid algorithm starts with a large 
ball in d-space which is guaranteed to contain the polyhedron Ax < b. Even though we 
do not yet know if the polyhedron is empty or non-empty, such a ball can be found. The 
algorithm checks if the center of the ball is in the polyhedron, if it is, we have achieved our 
objective. If not, we know from the Separating Hyperplane Theorem of convex geometry 
that there is a hyperplane called the separating hyperplane through the center of the ball 





45 Although there are examples where the simplex algorithm requires exponential time, it was shown 
by Shanghua Teng and Dan Spielman that the expected running time of the simplex algorithm on an 
instance produced by taking an arbitrary instance and then adding small Gaussian perturbations to it is 
polynomial. 
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Ellipsoid containing half- 
sphere 
polytope 


A | 


Separating hyperplane 


Figure 10.7: Ellipsoid Algorithm 


such that the whole polytope lies in one of the half spaces. 


We then find an ellipsoid which contains the ball intersected with this half-space. See 
Figure 10.7. The ellipsoid is guaranteed to contain Ax < b as was the ball earlier. If the 
center of the ellipsoid does not satisfy the inequalities, then again there is a separating 
hyper plane and we repeat the process. After a suitable number of steps, either we find a 
solution to the original Ax < b or we end up with a very small ellipsoid. If the original A 
and b had integer entries, one can ensure that the set Ax < b, after a slight perturbation 
which preserves its emptiness/non-emptiness, has a volume of at least some e > 0. If our 
ellipsoid has shrunk to a volume of less than this e, then there is no solution. Clearly 
this must happen within log, Vo/e = O(Vod/e) steps, where Vo is an upper bound on the 
initial volume and p is the factor by which the volume shrinks in each step. We do not 
go into details of how to get a value for Vo, but the important points are that (i) only the 
logarithm of Vo appears in the bound on the number of steps, and (ii) the dependence on 
d is linear. These features ensure a polynomial time algorithm. 


The main difficulty in proving fast convergence is to show that the volume of the 
ellipsoid shrinks by a certain factor in each step. Thus, the question can be phrased as 
suppose E is an ellipsoid with center xp and consider the half-ellipsoid E” defined by 

E' = {x|x € E, a-(x— xo) > 0) 


where a is some unit length vector. Let Ê be the smallest volume ellipsoid containing E”. 
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Show that 


A 





Vol(E) 
siej 
Vol(E) 
for some p > 0. A sequence of geometric reductions transforms this into a simple problem. 
Translate and then rotate the coordinate system so that xy = 0 and a = (1,0,0,...,0). 


Finally, apply a non-singular linear transformation 7 so that TE = B = {x| |x| = 1}, the 
unit sphere. The important point is that a non-singular linear transformation 7 multiplies 
the volumes of all sets by |det(7)|, so that airs, = are The following lemma answers 


the question raised. 





Lemma 10.10 Consider the half-sphere B' = {x|xı > 0, |x| < 1}. The following 
ellipsoid E contains B': 


E ALIS LAR. fe 5 
a= {x (+) (2-5) r( B ) (atat ta) <1 f 


Further, 
Vo(E) / d gE yA R 
Vo(B) \d+1)/ \d?-1 T 4d 


The proof is left as an exercise (Exercise 10.27). 














10.7 Integer Optimization 


The problem of maximizing a linear function subject to linear inequality constraints, 
but with the variables constrained to be integers is called integer programming. 


Max c-x subject to Ax < b with z; integers 


This problem is NP-hard. One way to handle the hardness is to relax the integer con- 
straints, solve the linear program in polynomial time, and round the fractional values to 
integers. The simplest rounding, round each variable which is 1/2 or more to 1, the rest 
to 0, yields sensible results in some cases. The vertex cover problem is one of them. The 
problem is to choose a subset of vertices so that each edge is covered with at least one of 
its end points in the subset. The integer program is: 


Min Y 2, subject to x; + xj > 1 V edges (i, j); x; integers . 


2 


Solve the linear program. At least one variable for each edge must be at least 1/2 and 
the simple rounding converts it to one. The integer solution is still feasible. It clearly 
at most doubles the objective function from the linear programming solution and since 
the LP solution value is at most the optimal integer programming solution value, we are 
within a factor of two of the optimal. 
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10.8 Semi-Definite Programming 


Semi-definite programs are special cases of convex programs. Recall that an n x n ma- 
trix A is positive semi-definite if and only if A is symmetric and for all x € R”, x7 Ax > 0. 
There are many equivalent characterizations of positive semi-definite matrices. We men- 
tion one. A symmetric matrix A is positive semi-definite if and only if it can be expressed 
as A = BB” for a possibly rectangular matrix B. 





A semi-definite program (SDP) is the problem of minimizing a linear function cx 
subject to a constraint that F = Fo + Fa, + Foz +---+ Faxa is positive semi-definite. 
Here Fo, Fi,..., Fa are given symmetric matrices. 


This is a convex program since the set of x satisfying the constraint is a convex 
set. To see this, note that if F(x) = Fo + Fix, + Foro +--+ + Faza and F(y) = 
Fo + Fiyi + Foya +--+ + Faya are positive semi-definite, then so is F(ax +(1- a)y) 
for 0 < a < 1. In principle, SDP’s can be solved in polynomial time. It turns out 
that there are more efficient algorithms for SDP’s than general convex programs and that 
many interesting problems can be formulated as SDP’s. We discuss the latter aspect here. 








Linear programs are special cases of SDP’s. For any vector v, let diag(v) denote a 
diagonal matrix with the components of v on the diagonal. Then it is easy to see that 
the constraints v > 0 are equivalent to the constraint diag(v) is positive semi-definite. 
Consider the linear program: 


Minimize c?x subject to Ax = b; x > 0. 


Rewrite Ax = b as Ax — b > 0 and b — Ax > 0 and use the idea of diagonal matrices 
above to formulate this as an SDP. 


A second interesting example is that of quadratic programs of the form: 


cTx 7 
Minimize ~4ry 





subject to Ax + b > 0. 


This is equivalent to 


cl x 2 
Minimize t subject to Ax + b > 0 and t > Gey : 





This is in turn equivalent to the SDP 


Minimize t subject to the following matrix being positive semi-definite: 


diag(Ax+b) 0 0 
0 t cx 
0 cx dix 
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Application to approximation algorithms. 


An exciting area of application of SDP is in finding near-optimal solutions to some 
integer problems. The central idea is best illustrated by its early application in a break- 
through due to Goemans and Williamson [GW95] for the maximum cut problem which 
given a graph G(V, E) asks for the cut S, S maximizing the number of edges going across 
the cut from S to S. For each i € V, let x; be an integer variable assuming values +1 
depending on whether i € S or i € S respectively. Then the max-cut problem can be 
posed as 


Maximize >», (1—2;2;) subject to the constraints x; € {—1, +1}. 
(i,j)EE 

The integrality constraint on the x; makes the problem NP-hard. Instead replace the 
integer constraints by allowing the x; to be unit length vectors. This enlarges the set of 
feasible solutions since +1 are just 1-dimensional vectors of length 1. The relaxed problem 
is an SDP and can be solved in polynomial time. To see that it is an SDP, consider x; as 
the rows of a matrix X. The variables of our SDP are not X, but actually Y = XXT, 
which is a positive semi-definite matrix. The SDP is 


Maximize >» (1— y;;) subject to Y positive semi-definite, 
(1,5) €E 


which can be solved in polynomial time. From the solution Y, find X satisfying Y = XXT. 
Now, instead of a +1 label on each vertex, we have vector labels, namely the rows of X. 
We need to round the vectors to +1 to get an S. One natural way to do this is to pick 
a random vector v and if for vertex 1, x;- v is positive, put i in S, otherwise put it in 
S. Goemans and Wiiliamson showed that this method produces a cut guaranteed to be 
at least 0.878 times the maximum. The .878 factor is a big improvement on the previous 


best factor of 0.5 which is easy to get by putting each vertex into S with probability 1/2. 
Application to machine learning. 


As discussed in Chapter 5, kernel functions are a powerful tool in machine learning. 
They allow one to apply algorithms that learn linear classifiers, such as Perceptron and 
Support Vector Machines, to problems where the positive and negative examples might 
have a more complicated separating curve. 


More specifically, a kernel K is a function from pairs of examples to reals such that 
for some implicit function ¢ from examples to RY, we have K(a, a”) = b(a)" d(a”). (We 
are using “a” and “a!” to refer to examples, rather than x and x”, in order to not conflict 
with the notation used earlier in this chapter.) Notice that this means that for any set of 
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examples {a@1,d2,...,@n}, the matrix A whose ij entry equals K(a;, az) is positive semi- 
definite. Specifically, A = BBT where the it row of B equals ¢(a;). 


Given that a kernel corresponds to a positive semi-definite matrix, it is not surprising 
that there is a related use of semi-definite programming in machine learning. In particular, 
suppose that one does not want to specify up-front exactly which kernel an algorithm 
should use. In that case, a natural idea is instead to specify a space of kernel functions and 
allow the algorithm to select the best one from that space for the given data. Specifically, 
given some labeled training data and some unlabeled test data, one could solve for the 
matrix A over the combined data set that is positive semi-definite (so that it is a legal 
kernel function) and optimizes some given objective. This objective might correspond 
to separating the positive and negative examples in the labeled data while keeping the 
kernel simple so that it does not over-fit. If this objective is linear in the coefficients of 
A along with possibly additional linear constraints on A, then this is an SDP. This is the 
high-level idea of kernel learning, first proposed in [LCB*04]. 
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Arrow’s impossibility theorem, stating that any ranking of three or more items satisfying 
unanimity and independence of irrelevant alternatives must be a dictatorship, is from 
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Gibbard [Gib73] and Satterthwaite [Sat75]. A good discussion of issues in social choice 
appears in [Lis13]. The results presented in Section 10.2.2 on compressed sensing are due 
to Donoho and Elad [DE03] and Gribonval and Nielsen [GN03]. See [Don06] for more 
details on issues in compressed sensing. The ellipsoid algorithm for linear programming is 
due to Khachiyan [Kha79] based on work of Shor [Sho70] and Iudin and Nemirovski [IN77]. 
For more information on the ellipsoid algorithm and on semi-definite programming, see 
the book of Grótschel, Lovász, and Schrijver [GLS12]. The use of SDPs for approximating 
the max-cut problem is due to Goemans and Williamson[GW95], and the use of SDPs for 
learning a kernel function is due to [LCB*04]. 
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10.10 Exercises 


Exercise 10.1 Select a method that you believe is good for combining individual rankings 
into a global ranking. Consider a set of rankings where each individual ranks b last. One 
by one move b from the bottom to the top leaving the other rankings in place. Does there 
exist av as in Theorem 10.2 where v is the ranking that causes b to move from the bottom 
to the top in the global ranking. If not, does your method of combing individual rankings 
satisfy the axioms of unanimity and independence of irrelevant alternatives. 


Exercise 10.2 Show that for the three axioms: non-dictator, unanimity, and indepen- 
dence of irrelevant alternatives, it 1s possible to satisfy any two of the three. 


Exercise 10.3 Does the axiom of independence of irrelevant alternatives make sense? 
What if there were three rankings of five items. In the first two rankings, A 1s number one 
and B is number two. In the third ranking, B is number one and A is number five. One 
might compute an average score where a low score is good. A gets a score of 1+1+5=7 
and B gets a score of 2+2+1=5 and B is ranked number one in the global ranking. Now if 
the third ranker moves A up to the second position, A’s score becomes 1+1+2=4 and the 
global ranking of A and B changes even though no individual ranking of A and B changed. 
Is there some alternative axiom to replace independence of irrelevant alternatives? Write 
a paragraph on your thoughts on this issue. 


Exercise 10.4 Prove that in the proof of Theorem 10.2, the global ranking agrees with 
column v even if item b is moved down through the column. 


Exercise 10.5 Let A be anm byn matriz with elements from a zero mean, unit variance 
Gaussian. How large must n be for there to be two or more sparse solutions to Ax = b 
with high probability. You will need to define how small s should be for a solution with at 
most s non-zero elements to be sparse. 


Exercise 10.6 Section 10.2.1 showed that if A is ann x d matrix with entries selected 
at random from a standard Gaussian, and n > 2s, then with probability one there will be 
a unique s-sparse solution to Ax = b. Show that ifn < s, then with probability one there 
will not be a unique s-sparse solution. Assume d > s. 


Exercise 10.7 Section 10.2.2 used the fact that n = O(s?logd) rows is sufficient so 
that if each column of A is a random unit-length n-dimensional vector, then with high 
probability all pairwise dot-products of columns will have magnitude less than Ł. Here, 
we show that n = Q(logd) rows is necessary as well. To make the notation less confusing 
for this argument, we will use “m” instead of “d”. 

Specifically, prove that form > 3”, it is not possible to have m unit-length n-dimensional 
vectors such that all pairwise dot-products of those vectors are less than 3. 

Some hints: (1) note that if two unit-length vectors u and v have dot-product greater 
than or equal to i then |u — v| < 1 (if their dot-product is equal to 5 then u, v, and the 
origin form an equilateral triangle). So, it is enough to prove that m > 3” unit-length 
vectors in R” cannot all have distance at least 1 from each other. (2) use the fact that the 


volume of a ball of radius r in R” is proportional to r”. 
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Exercise 10.8 Create a random 100 by 100 orthonormal matrix A and a sparse 100- 
dimensional vector x. Compute Ax = b. Randomly select a few coordinates of b and 
reconstruct x from the samples of b using the minimization of 1-norm technique of Section 
10.2.2. Did you get x back? 


Exercise 10.9 Let A be a low rank nx m matriz. Let r be the rank of A. Let A be A 
corrupted by Gaussian noise. Prove that the rank r SVD approximation to A minimizes 


V2 
|a- Al 

F 
Exercise 10.10 Prove that minimizing ||x||o subject to Ax = b is NP-complete. 


Exercise 10.11 When one wants to minimize ||x||9 subject to some constraint the prob- 
lem is often NP-hard and one uses the 1-norm as a proxy for the 0-norm. To get an 
insite into this issue consider minimizing ||x||o subject to the constraint that x lies in a 
convex region. For simplicity assume the convex region is a sphere with center more than 
the radius of the circle from the origin. Explore sparsity of solution when minimizing the 
1-norm for values of x in the circular region with regards to location of the center. 


Exercise 10.12 Express the matrix 


2 17 2 2 2 
E ee 
2 A 
20 e ANA 
I3 -2-2 27 2 


as the sum of a low rank matrix plus a sparse matrix. To simplify the computation assume 
you want the low rank matrix to be symmetric so that its singular valued decomposition 
will be VEV?. 

Exercise 10.13 Generate 100 x 100 matrices of rank 20, 40, 60 80, and 100. In each 
matrix randomly delete 50, 100, 200, or 400 entries. In each case try to recover the 
original matrix. How well do you do? 


Exercise 10.14 Repeat the previous exercise but instead of deleting elements, corrupt the 
elements by adding a reasonable size corruption to the randomly selected matrix entries. 


End of sparse solutions, start of Uncertainty principle 
Exercise 10.15 Compute the Fourier transform of the sequence 1000010000. 
Exercise 10.16 What is the Fourier transform of a Gaussian? 


Exercise 10.17 What is the Fourier transform of a cyclic shift of a sequence? 
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Exercise 10.18 Let S(i, j) be the sequence of i blocks each of length j where each block 
of symbols is a 1 followed by j — 1 0’s. The number n=6 is factorable but not a perfect 
square. What is Fourier transform of S (2,3)= 100100? 


Exercise 10.19 Let Z be then root of unity. Prove that [20 <i< n} S420 n} 
provide that b does not divide n. 


Exercise 10.20 Show that if the elements in the second row of the n x n Vandermonde 
matriz 


1 1 

a b 

a? b2 C2 
qr brt aED crt 


are distinct, then the Vandermonde matrix is non-singular by expressing the determinant 
of the matrix as ann — 1 degree polynomial in a. 


Exercise 10.21 Show that the following two statements are equivalent. 


1. If the elements in the second row of then x n Vandermonde matrix 


1 1 

a b c 

a? b? e? 
qr-t pri A al 


are distinct, then the Vandermonde matriz is non-singular. 


2. Specifying the value of an nt degree polynomial at n +1 points uniquely determines 
the polynomial. 


Exercise 10.22 Many problems can be formulated as finding x satisfying Ax = b where 
A has more columns than rows and there is a subspace of solutions. If one knows that the 
solution is sparse but some error in the measurement b may prevent finding the sparse 
solution, they might add some residual error to b and reformulate the problem as solving 
for x andr subject to Ax = b +r where r is the residual error. Discuss the advantages 
and disadvantages of each of the following three versions of the problem. 


1. Set r=0 and find x= argmin ||x||, satisfying Ax = b 
2. Lasso: find x= argmin (||x||, +a lirl) satisfying Ax = b +r 


3. find z=argmin ||x||, such that Irl <E 
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Exercise 10.23 Let M = L+R where L is a low rank matrix corrupted by a sparse noise 
matriz R. Why can we not recover L from M if R is low rank or if L is sparse? 


Exercise 10.24 


1. Suppose for a univariate convex function f and a finite interval D, |f"(a)| < ô| f’ (x)| 
for every x. Then, what is a good step size to choose for gradient descent? Derive a 
bound on the number of steps needed to get an approximate minimum of f in terms 
of as few parameters as possible. 


2. Generalize the statement and proof to convex functions of d variables. 


Exercise 10.25 Prove that the maximum of a convex function over a polytope is attained 
at one of its vertices. 


Exercise 10.26 Create a convex function and a convex region where the maximization 
problem has local maximuns. 


Exercise 10.27 Prove Lemma 10.10. 


Exercise 10.28 Consider the following symmetric matrix A: 


1 0 1 1 
0 1 di 
1 1 2 0 
1 —1 0 2 


Find four vectors v1, V2, V3, Va such that a;; = vi'v; for all 1 < i,j < 4. Also, find a 
matriz B such that A = BB’. 


Exercise 10.29 Prove that if A, and Az are positive semi-definite matrices, then so is 
A; + Ao. 


7. Smoothed Analysis of Algorithms: The Simplex Algorithm Usually Takes a Polyno- 
mial Number of Steps, Journal of the Association for Computing Machinery (JACM), 51 
(3) pp: 385463, May 2004. Conference Version: the Annual ACM Symposium on Theory 
of Computing, pages 296-305, 2001 (with Dan Spielman). 
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11 Wavelets 


Given a vector space of functions, one would like an orthonormal set of basis functions 
that span the space. The Fourier transform provides a set of basis functions based on 
sines and cosines. Often we are dealing with functions that have finite support in which 
case we would like the basis vectors to have finite support. Also we would like to have 
an efficient algorithm for computing the coefficients of the expansion of a function in the 
basis. 


11.1 Dilation 


We begin our development of wavelets by first introducing dilation. A dilation is a 
mapping that scales all distances by the same factor. 


ie 


A dilation equation is an equation where a function is defined in terms of a linear 
combination of scaled, shifted versions of itself. For instance, 








fo) = Y egf (20 — B). 


An example of this is f(x) = f(2x) + f(Qx — 1) which has a solution f(x) equal to one 
for 0 < a < 1 and is zero elsewhere. The equation is illustrated in the figure below. The 
solid rectangle is f(x) and the dotted rectangles are f(2x) and f(2x — 1). 











Another example is f(z) = f(2x) + fQu— 1) + $f(2x — 2). A solution is illustrated 
in the figure below. The function f(x) is indicated by solid lines. The functions z fx), 
f(Qx +1), and 4f(2x — 2) are indicated by dotted lines. 














If a dilation equation is of the form ee crf (2x — k) then we say that all dilations in the 
equation are factor of two reductions. 


Lemma 11.1 /fa dilation equation in which all the dilations are a factor of two reduction 
has a solution, then either the coefficients on the right hand side of the equation sum to 
two or the integral ies f(x)dx of the solution is zero. 


Proof: Integrate both sides of the dilation equation from —oo to +00. 


00 oo a—1 d—1 E 
ER ade = f Fa (Qc — k dz = Ye] fzx —k)dx 
= e k=0 k=0 =S 
1 db 
Enf pais 
If is f(x)dx # 0, then dividing both sides by ii f(x)dx gives = C2 A 


— 00 


The above proof interchanged the order of the summation and the integral. This is valid 
provided the 1-norm of the function is finite. Also note that there are non-zero solutions to 
dilation equations in which all dilations are a factor of two reduction where the coefficients 
do not sum to two such as 


f(x) = f(2x) + f(2x — 1) + fr — 2) + f(Q1-— 3) 


f(x) = f(Qxr) + 2f (2x — 1) + 2f (2x — 2) + 2f (2x — 3) + f(2x — 4). 


In these examples f(x) takes on both positive and negative values and f% f(x)dax = 0. 


11.2 The Haar Wavelet 


Let f(x) be a solution to the dilation equation f(x) = f(Qx)+f(Qx—1). The function 
@ is called a scale function or scale vector and is used to generate the two dimensional 
family of functions, $;4(a) = ¢(2/x — k), where j and k are non-negative integers. Other 
authors scale $j, = ¢(2/x — k) by 22 so that the 2-norm, f% %,(t)dt, is 1. However, for 
educational purposes, simplifying the notation for ease of understanding was preferred. 


For a given value of j, the shifted versions, {@;,|k > 0), span a space V;. The spaces 
Vo, Vi, V2,... are larger and larger spaces and allow better and better approximations to 
a function. The fact that f(x) is the solution of a dilation equation implies that for any 
fixed j, jx is a linear combination of the {¢j+1,4/|k’ > 0} and this ensures that V; C Vj41. 
It is for this reason that it is desirable in designing a wavelet system for the scale function 
to satisfy a dilation equation. For a given value of j, the shifted j are orthogonal in the 
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1 2 8 lL 2 3 1 2 8 1. 2 3 
pa(x) = (4r) alr) =p(4x—1) pz(x) = p(4x —2) pz(x) = (4r — 3) 


Figure 11.1: Set of scale functions associated with the Haar wavelet. 


sense that f, j4(v)Oj(x)dx = 0 for k Al. 


Note that for each j, the set of functions fjx, k = 0,1,2..., form a basis for a vector 
space V; and are orthogonal. The set of basis vectors @;,, for all 7 and k, form an over- 
complete basis and for different values of j are not orthogonal. Since jk, Ọj+1,2k, and 
Ọj+1,2k+1 are linearly dependent, for each value of j delete $41, for odd values of k to 
get a linearly independent set of basis vectors. To get an orthogonal set of basis vectors, 
define 


2k 2k+1 
1 ae 








2k+1 2 +2 
Va(z)=4 -1 EP <a < 


0 otherwise 


and replace /;2x with 4;+1 21. Basically, replace the three functions 











1 1 1 





























DiR 


d Cr) ó(20—1) 


by the two functions 
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The Haar Wavelet 
r olz) 
1 0<2<1 
o(x) = 
0 otherwise i x 
vix 
; (x) 
1 eses 
i 1 
OD) =<. -1 el E 
O otherwise f 












































1 1 
1 
1 
(2) W(x) 
The basis set becomes 
oo Vio 
Yao Voz 


W30 W32 P34 036 
Wao Vaz Vas Vas Vas Wa,10 Ware Waa 


To approximate a function that has only finite support, select a scale vector $(z) 
whose scale is that of the support of the function to be represented. Next approximate 
the function by the set of scale functions p(2x — k), k =0,1,..., for some fixed value of 
j. The value of 7 is determined by the desired accuracy of the approximation. Basically 
the x axis has been divided into intervals of size 277 and in each interval the function is 
approximated by a fixed value. It is this approximation of the function that is expressed 
as a linear combination of the basis functions. 


Once the value of j has been selected, the function is sampled at 2’ points, one in 
each interval of width 277. Let the sample values be so, s1,.... The approximation to the 
function is ar sip(2x — k) and is represented by the vector (so, $1..., 82/1). The 
problem now is to represent the approximation to the function using the basis vectors 
rather than the non-orthogonal set of scale functions f(x). This is illustrated in the 
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following example. 


To represent the function corresponding to a vector suchas( 3 1 4 8 3 5 7 9), 
one needs to find the c; such that 


3 E E DO g we. 10 c 
1 Lo i Qe SO. g 40 ca 
4 Liber HO E le A C3 
Sl i at Oe Or OL <0 C4 
SF Week OO E e E 10 Cs 
5 Lak w i 05 SOME 0 C6 
7 tar. DA O SO a c7 
9 peb m OF O01 Cs 


The first column represents the scale function (x) and subsequent columns the w’s. 
The tree in Figure 11.2 illustrates an efficient way to find the coefficients representing 
the vector (3 1 4 8 3 5 7 9 ) in the basis. Each vertex in the tree contains the 
average of the quantities of its two children. The root gives the average of the elements in 
the vector, which is 5 in this example. This average is the coefficient of the basis vector 
in the first column of the above matrix. The second basis vector converts the average 
of the eight elements into the average of the first four elements, which is 4, and the last 
four elements, which is 6, with a coefficient of -1. Working up the tree determines the 
coefficients for each basis vector. 





CY YY 
< i 
E 


Figure 11.2: Tree of function averages 
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11.3 Wavelet Systems 


So far we have explained wavelets using the simple-to-understand Haar wavelet. We 
now consider general wavelet systems. A wavelet system is built from a basic scaling 
function f(x), which comes from a dilation equation. Scaling and shifting of the basic 
scaling function gives a two dimensional set of scaling functions j where 


jn (x) = p(x — k). 
For a fixed value of j, the jx span a space Vj. If d(x) satisfies a dilation equation 
d-1 
d(x) = Y cgG(2x — k), 
k=0 
then @;, is a linear combination of the ¢;41,,’8 and this implies that Vo € Vi C Va C V3--- . 


11.4 Solving the Dilation Equation 


Consider solving a dilation equation 


gla) = Y apr — k) 


to obtain the scale function for a wavelet system. Perhaps the easiest way is to assume 
a solution and then calculate the scale function by successive approximation as in the 
following program for the Daubechies scale function: 


pla) = 4Y8 (2x) + 28 (2 — 1) + BB (20 — 2) + HB g(2x — 3), 








The solution will actually be samples of ¢(x) at some desired resolution. 


Program Compute-Daubechies: 


Insert the coefficients of the dilation equation. 


S 
$ 


— 14-73 _ 3+4v3 _ 3- = 
ae ĉ&2 = -7 == Ca == 








C1 


Set the initial approximation to p(x) by generating a vector whose components 
approximate the samples of ¢(x) at equally spaced values of z. 


Execute the following loop until the values for (x) converge. 


begin 


Calculate p(2x) by averaging successive values of ¢(x) together. Fill 
out the remaining half of the vector representing ¢(2x) with zeros. 
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Calculate p(21—1), p(22—2), and p(21—3) by shifting the contents 
of p(2x) the appropriate distance, discarding the zeros that move 
off the right end and adding zeros at the left end. 

Calculate the new approximation for f(x) using the above values 
for p(21 — 1), p(2x — 2), and p(2x — 3) in the dilation equation for 
(2x). 


end 








\ i 
A+ 


S 
A 

















10 L L 1 1 L f 1 1 L 1.5 1 1 L L L 1 L L 1 
0 10 20 30 4 50 6 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 








Figure 11.3: Daubechies scale function and associated wavelet 


The convergence of the iterative procedure for computing is fast if the eigenvectors of 
a certain matrix are unity. 


Another approach to solving the dilation equation 


Consider the dilation equation ¢(x) = ¿f (2x) + f (2x — 1) + 4f (2x — 2) and consider 
continuous solutions with support in 0 < x < 2. 


$(0) = ¿o(0 ) + (-1) + 9(-2) = 3010) +0+0 p(0) =0 
(2) = 9 (4) + ASA eae Ar oa p(2) =0 
(1) = 30(2) + O(1) + 9(0) = 0+ d(1) + p) arbitrary 





plz) = 30(1) + 6(0) + ¿ó(-1) = 
$(3) = 3003) + 0(2) + 30(1) = 5 
$a) = 305) + (3) +23) = 1 

One can continue this process and compute ¢(4) for larger values of j until (x) is 


approximated to a desired accuracy. If f(x) is a simple equation as in this example, one 
could conjecture its form and verify that the form satisfies the dilation equation. 


Nie 
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11.5 Conditions on the Dilation Equation 


We would like a basis for a vector space of functions where each basis vector has 
finite support and the basis vectors are orthogonal. This is achieved by a wavelet system 
consisting of a shifted version of a scale function that satisfies a dilation equation along 
with a set of wavelets of various scales and shifts. For the scale function to have a non- 
zero integral, Lemma 11.1 requires that the coefficients of the dilation equation sum to 
two. Although the scale function (x) for the Haar system has the property that (x) 
and ¢(a—k), k > 0, are orthogonal, this is not true for the scale function for the dilation 
equation (x) = $6(2x) + ¢(2x — 1) + $¢(2x —2). The conditions that integer shifts of the 
scale function be orthogonal and that the scale function has finite support puts additional 
conditions on the coefficients of the dilation equation. These conditions are developed in 
the next two lemmas. 


Lemma 11.2 Let sae 
z) = ` cr0(2x — k 
k=0 


If d(x) and plz — k) are orthogonal for k 4 0 and f(x) has been normalized so that 
J> o(z)b(a — k)dx = ô(k), then ae CiCi—2k = 20(k). 


Proof: Assume ¢(x) has been normalized so that f°. ó(u)ó(x — k)dx = 6(k). Then 


Je ojos — Ride =f Yao Yo eer — — a 


II 
o 
$ 
T= 
i g 
8 


(2x — 1)p(Q1 — 2k — j)dx 


i=0 j=0 
Since 
ii l EiS l 
f. oee- oar af RDA 
SO fe 
+ f o(y)oly +4 — 2k — j)dy 
1 Sal 
= 50(2k + j — i), 
2 
oo —1 d-1 ¡+ 1 
E 9() O( = lia Ci zð (2k + j — i) = Yes 2k- Since (x) was nor- 
malized so that Pe 
i olx)olx — k)dx = 6(k), it follows that y CiCi—2k = 26(k). = 
=) i=0 
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Scale and wavelet coefficients equations 


do) = Dibar — k) Va) = E bole 1) 

J Weele- kjde = 80H) _F oewe-a)=0 
an? _F Waar =o 

Y ojej-æ = 26(K) J vewe- Rade = 58 
ae 0 unless 0 <k <d-1 E (=D Bibian = 26(k) 

d even ed 


d-1 d-1 
dy Caj = D C2j+1 
j=0 j=0 


as 
| 


1 


eal 


bj =0 
0 


= (DP caia 


G: 
ll 


> 
E] 
| 


One designs wavelet systems so the above conditions are satisfied. 











Lemma 11.2 provides a necessary but not sufficient condition on the coefficients of 
the dilation equation for shifts of the scale function to be orthogonal. One should note 
that the conditions of Lemma 11.2 are not true for the triangular or piecewise quadratic 
solutions to 


1 1 
P(x) = ¿ó(21) + $27 — 1) + 5027 — 2) 
and 
alas ooreis O era) 
ola) = FOr) + Jo(20—1) +A) 
which overlap and are not orthogonal. 


For f(x) to have finite support the dilation equation can have only a finite number of 
terms. This is proved in the following lemma. 


Lemma 11.3 IfO < x < d is the support of f(x), and the set of integer shifts, Lplx — 
k)|k > 0}, are linearly independent, then cg = 0 unless 0 < k < d— 1. 
Proof: If the support of ọ(x) is 0 < x < d, then the support of p(21) is 0 < £ < z, If 


00 


(x) = Y, cols — k) 


k=-—00 
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the support of both sides of the equation must be the same. Since the p(1—k) are linearly 
independent the limits of the summation are actually k = 0 to d — 1 and 


d-1 
z) = ` cr0(2x — k 
k=0 


It follows that c = 0 unless 0 < k < d- 1. 


The condition that the integer shifts are linearly independent is essential to the proof 
and the lemma is not true without this condition. E 


d-1 
One should also note that >> cic;_2, = 0 for k 4 0 implies that d is even since for d odd 


¿=0 
and k = aoa 


d—1 d—1 
> CiCi—2k = > CiCi-d+1 = Cd—-1C0- 
1=0 i=0 


For cg_1¢9 to be zero either cg_; or co must be zero. Since either co = 0 or c¿-1 = 0, 
there are only d — 1 non-zero coefficients. From here on we assume that d is even. If the 
dilation equation has d terms and the coefficients satisfy the linear Sauen D k= 2 
and the g quadratic equations D CiCi-2k = 20(k) for 1 < k < > , then for d > 2 there 
are 2 — 1 coefficients that can be used to design the wavelet pe to achieve desired 


2 
properties. 


11.6 Derivation of the Wavelets from the Scaling Function 


In a wavelet system one develops a mother wavelet as a linear combination of integer 
shifts of a aala version of the scale function f(x). Let the mother wavelet w(x) be given 


by y(x) = bko(2x — k). One wants integer shifts of the mother wavelet w(x — k) to 


be dae and also for integer shifts of the mother wavelet to be orthogonal to the 
scaling function ¢(x). These conditions place restrictions on the coefficients bẹ which are 
the subject matter of the next two lemmas. 


Lemma 11.4 (Orthogonality of y(x) and y(x — k)) Let w(x) = > bp@(2a — k). If (ax) 


and W(x—k) are orthogonal for k 4 0 and y(x) has been a so that Aes: V(a)v(a-— 
k)dx = 6(k), then 


a 
. 


(—1)*b;b;—2n = 26(k). 


ll 
(æ 


i 


Proof: Analogous to Lemma 11.2. E 
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d—1 


Lemma 11.5 (Orthogonality of p(x) and y(x — k)) Let p(1) = Y crp(Qu — k) and 


k=0 
d—1 oo oo 
pla) = MN d:pQr—k). If f dlajolz—k)de = 6(k) and f $(x)b(a—k)dz =0 for 
k=0 T=—00 x=—00 
all k, then S Cibi-2x = 0 for all k. 
i=0 


Proof: 


is ajule kde = f Sa 2r—i JE oe- 2k- jide 


Interchanging the order of integration and summation 


> 


—1 d-1 


cb; f. (2x — i)ọ(2x — 2k — j)dx = 0 


ll 
O 
ad 

ll 
(= 


1 
IOD fly += 


Thus, 


= 


—1 d- 


i 


cbjó (2k +j —i)= 0 
i=0 j=0 


Summing over j gives 


d-1 
> Cibj_2%, = 0 
i=0 


Lemma 11.5 gave a condition on the coefficients in the equations for ọ(x) and y(x) if 
integer shifts of the mother wavelet are to be orthogonal to the scale function. In addition, 
for integer shifts of the mother wavelet to be orthogonal to the scale function requires 
that bk = = (- De 1-—k- 


d-1 
Lemma 11.6 Let the scale function p(x) equal X` cip(2x—k) and let the wavelet function 
k=0 


d-1 
w(x) equal X` bip(2x — k). If the scale functions are orthogonal 
k=0 


foege- de = (4) 
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and the wavelet functions are orthogonal with the scale function 


f pleya — k)de = 0 


LT=— 00 


for all k, then by = (—1)"ca-1—k. 


d—1 d—1 
Proof: By Lemma 11.5, >> cjbj-2x = 0 for all k. Separating > cjb;-2x = 0 into odd and 

















j=0 j=0 
even indices gives 
d_y d_y 
2 2 
Y Cojbaj—an + Y Cajr1Doj+1 21 = 0 
j=0 j=0 
for all k. 
Cobo H Caba Caba mes ciby C3b3 C5b5 ---=0 k=Q 
Cabo Caba eos C301 Csb3 ---=0 k=1 
Cabo +++ C5b1 =N A) k=2 














d—1 


(11.1) 


d-1 
By Lemmas 11.2 and 11.4, ` cjcj-ox = 20(k) and >> bjbj-2x = 20(k) and for all k. 
j=0 j=0 


j 
Separating odd and even terms, 
































2-1 2-1 
` C2jC2j—2k + Ss Caj+1C25+1-2k = 20(k) 
j=0 j=0 
and 
2-1 2-1 
> bojbaj-2 + X (1) bzj+1b2j+1-2k = 20(k) 
j=0 j=0 
for all k. 
CoCo + C2C2+C4C4 see + G&G C3C3 C5C5 -= 2 k=0 
CaCo T C4Ca ses C3C1 C5C3 ---=0 ha 
CCo TT...” C5C1 «..=() k=2 
bobo + baba +baba +--+ +b1b1 — b3b3 + b5b5 — --- = 2 k=0 
babo babe eee — b3b, + bsb3 —--- =0 k= 
Bubp dors +bsb,—----=0 k=2 
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(11.2) 


(11.3) 


Let Ce = (co, Crai , Cd-2), Co = (c, Chair ,Cd-1), Be = (bo, ba, Eua , ba-2), and Bo = 
(b,,b3,...,b4-1). Equations 12.1, 12.2, and 11.3 can be expressed as convolutions* of 
these sequences. Equation 12.1 is Ce* BE +C,* BP = 0, 12.2 is Ce» CE +C,*C® = 6(k), 
and 11.3 is Be * B® + Bo x B® = 6(k), where the superscript R stands for reversal of the 
sequence. These equations can be written in matrix format as 


a le 5) = (e 0 ) 
Be Bo CE BÈ 0 26 
Taking the Fourier or z-transform yields 
( F(Ce) F(Co) ) ( CS) Ee) ) = ( 2 0 ) 
F(B.) F(B.) ) \ F(C?) F(B*) PEN 
where F denotes the transform. Taking the determinant yields 
(F(C)F(B,) - F(Be)F(Co)) (¥ (Ce) F(Bo) — F(Co)F(Be)) =4 
Thus F'(C.)F(B.) — F(C.)F(B.) = 2 and the inverse transform yields 
Ce * Bo — Cox Be = 26(k). 
Convolution by CF yields 
CE x Oe * Bo — CË x Be * Co = CP x 26(k) 
Now D cba = 0 so —CË x Be = CË x Bo. Thus 
= 
CR Cox Bo + CË x Box Co = 20% x 5(k) 
(CE x Oe + CP x Co) * Bo = 2C® x 5(k) 
20(k) * Bo = 2CF x 6(k) 
Ce = BË 
Thus, c; = 2b4_1-; for even i. By a similar argument, convolution by CË yields 
CË * Ce * Bo — CË * Co * Be = 2CẸ6(k) 
Since CË * By = CG * Be 
—CE x CË x Be — CË * Co x Be = 2CÈ (k) 
— (Ce x CË + CË * Co) * Be = 20 H0(k) 
—26(k) Be = 2CË6(k) 





—B. = OË 
Thus, c; = —2b4—1—; for all odd ¿ and hence c; = (—1)'2bg_1_; for all i. E 
46The convolution of (ay, a1, ...,a@q—1) and (bg, b1, ..-,ba—1) denoted 
(ao, 41,...,04-1) * (bo, b1, . . - , ba_1) is the sequence 
(aoba—1, @oba—2 + 41b4-1,40ba4-3 + a1ba—2 + agba—1..., a@a—1b0). 
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11.7 Sufficient Conditions for the Wavelets to be Orthogonal 


Section 11.6 gave necessary conditions on the bą and cx in the definitions of the scale 
function and wavelets for certain orthogonality properties. In this section we show that 
these conditions are also sufficient for certain orthogonality conditions. One would like a 
wavelet system to satisfy certain conditions. 


1. Wavelets, 41,(2x — k), at all scales and shifts to be orthogonal to the scale function 
p(z). 
2. All wavelets to be orthogonal. That is 


$ aser Kaz- mde = 8 = NSE — m) 


3. d(x) and Wx, j < land all k, to span V, the space spanned by ¢(2'x — k) for all k. 


These items are proved in the following lemmas. The first lemma gives sufficient conditions 
on the wavelet coefficients b; in the definition 


ylz) = Y bey (Qu — k) 


for the mother wavelet so that the wavelets will be orthogonal to the scale function. The 
lemma shows that if the wavelet coefficients equal the scale coefficients in reverse order 
with alternating negative signs, then the wavelets will be orthogonal to the scale function. 


Lemma 11.7 If bp = (—1)"ca-1-p, then [%_lajy(2x — dx = 0 for all j and 1. 


Proof: Assume that by = (—1)*cg_1_x. We first show that f(x) and y(x — k) are orthog- 
onal for all values of k. Then we modify the proof to show that ọ(x) and w(2/x — k) are 
orthogonal for all 7 and k. 


Assume bj = (—1)*cq_1—z. Then 


re dder d-1 
f aoue==/ Yasar) Y telas —2k- Ida 
—oo = j=0 j=0 
d—1 d—1 


= e;(—1)%ea_1_; ie b(2x — i)o(Qxr — 2k — j)dx 


il 
Rh O 


a e. 
I ll 
= O 


T 


(—1) esca 1-¡Ó(i — 2k — j) 


M 


>l 
| 
Rh O 
& 
ll 
o 


(1) Cony jCa—1- 


ll 
© 


2kCd—1 — C2k+1Cd—2 + +++ + Cd-2C2k—1 — Cd—1C2k 


|l 
¡E T A 
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The last step requires that d be even which we have assumed for all scale functions. 


For the case where the wavelet is 4(27 — 1), first express ¢(x) as a linear combination 
of p(2171x — n). Now for each these terms 


Í Hz — my (Lx — k)dz = 0 
To see this, substitute y = 2771x. Then 


1 co 
fa mez- Kae = 55 fou may - bay 
which by the previous argument is zero. A 


The next lemma gives conditions on the coefficients bẹ that are sufficient for the 
wavelets to be orthogonal. 


Lemma 11.8 Ifb, = (—1)*cq_1_x, then 
penni: l 1 : 
5 Ebla — kahe — m)de = 8G — 4( — m). 


Proof: The first level wavelets are orthogonal. 
aai 


f vow p(x — k ja > Ra Year -2k — is 


d-1 d- 
raya f (2x — 1)p(Q1 — 2k — j)dz 
d-1 d-1 


=X X bibj5(é — 2k — j) 
i=0 j=0 
d—1 

= $ bibi 
: 


= S (Deal DE cai 42% 


1=0 
d—1 


= A TA 


1=0 


Substituting j for d — 1 — i yields 
d-1 
Yo ejejror = 26(k) 
j=0 
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Example of orthogonality when wavelets are of different scale. 


gazi 


P p(2xjy(x — k jac =f ee Yo ote —2k— ds 
d—1 d-1 ES 
= bb; f (4x — 1)p(Qx — 2k — j)dx 
i=0 i=0 =e 


1 
Since p(2x — 2k — j) = D> cip(4x — 4k — 2j — 1) 
i=0 


d-1 d-1 d-1 ae 
f vew —k jar = 3 YY mia | v(4x — ¿)p(4x — 4k — 2j — dx 


1=0 j=0 


> 


1 d-1 d-1 


= =DD T — 2j — l) 
a 0 1=0 
—1 d-1 


A 4k—2j 


i=0 j=0 


d—1 d—1 
Since > Cjbj—2k = 0, > biCi—4k—2j = olj = 2k) Thus 
j=0 i=0 


[seo (x — k)dx = Yh 28) 0 


Orthogonality of scale function with wavelet of different scale. 


Paes 


[seu p(2x —k e 24 cid (2x — ¿(zx — k)dx 
Eon (p= ide 


= 5 Java 


j=0 Y- 


© Ne 


If Y was of scale 27, 6 would be expanded as a linear combination of ¢ of scale 2? all of 
which would be orthogonal to 4. A 
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11.8 Expressing a Function in Terms of Wavelets 


Given a wavelet system with scale function 4 and mother wavelet 4 we wish to express 
a function f(x) in terms of an orthonormal basis of the wavelet system. First we will ex- 
press f(a) in terms of scale functions ¢j,(”) = 6(2’2 — k). To do this we will build a tree 
similar to that in Figure 11.2 for the Haar system, except that computing the coefficients 
will be much more complex. Recall that the coefficients at a level in the tree are the 
coefficients to represent f(x) using scale functions with the precision of the level. 


Let f(x) = jo ajnb;(a — k) where the ajx are the coefficients in the expansion of 
f(a) using level j scale functions. Since the ġ;(x — k) are orthogonal 


an= f Hoole -bdr 
Expanding @; in terms of (+1 yields 


d—1 


ajk = y Fx) y CmQj+1 (22 — 2k — m)dx 


=— oo m=0 


1(0)6;11(2a — 2k — m)de 


II 
o 
o 
3 
A 
i Q 
8 


CmQj+1,2k+m 


m=0 


Let n = 2k + m. Now m = n — 2k. Then 


d—1 
ajk = ` Cn—-2k0j+1,n (11.4) 


n=2k 


In construction the tree similar to that in Figure 11.2, the values at the leaves are 
the values of the function sampled in the intervals of size 27. Equation 11.4 is used to 
compute values as one moves up the tree. The coefficients in the tree could be used if we 
wanted to represent f(x) using scale functions. However, we want to represent f(x) using 
one scale function whose scale is the support of f(x) along with wavelets which gives us 
an orthogonal set of basis functions. To do this we need to calculate the coefficients for 
the wavelets. The value at the root of the tree is the coefficient for the scale function. We 
then move down the tree calculating the coefficients for the wavelets. 


11.9 Designing a Wavelet System 


In designing a wavelet system there are a number of parameters in the dilation equa- 
tion. If one uses d terms in the dilation equation, one degree of freedom can be used to 
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satisfy 


which insures the existence of a solution with a non-zero mean. Another g degrees of 
freedom are used to satisfy 


d—1 
y CiCi-2k = 9(k) 
i=0 


which insures the orthogonal properties. The remaining g — 1 degrees of freedom can be 
used to obtain some desirable properties such as smoothness. Smoothness appears to be 
related to vanishing moments of the scaling function. Material on the design of systems 
is beyond the scope of this book and can be found in the literature. 


11.10 Applications 


Wavelets are widely used in data compression for images and speech, as well as in 
computer vision for representing images. Unlike the sines and cosines of the Fourier 
transform, wavelets have spatial locality in addition to frequency information, which can 
be useful for better understanding the contents of an image and for relating pieces of 
different images to each other. Wavelets are also being used in power line communication 
protocols that send data over highly noisy channels. 


11.11 Bibliographic Notes 


In 1909 Alfred Haar presented an orthonormal basis for functions with finite support. 
Ingrid Daubechies[Dau90] generalized Haar’s work and created the area of wavelets. There 
are many references on wavelets. Several that maybe be useful are Strang and Nguyen, 
Wavelets and Filter Banks [SN97] and Burrus, Gopinath, Guo, Intro Wavelets and Wavelet 
Transforms, A Primer[BGG97]. 
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11.12 Exercises 


Exercise 11.1 Give a solution to the dilation equation f(x) = f(Qx)+f(Qx—k) satisfying 
f(0) =1. Assume k is an integer. 


Exercise 11.2 Are there solutions to f(x) = f(2x) + f(2x% — 1) other than a constant 


multiple of 
10<zx<l1 
= rs P 
F(a) í O otherwise ` 


Exercise 11.3 Is there a solution to f(x) = $f(2x) + f(2x — 1) + 4f (2x — 2) with 
F(0) = f(1) = 1 and f(2) = 0? 


Exercise 11.4 What is the solution to the dilation equation 
f(x) = f (2x) + f(2x — 1) + f (2x — 2) + f (2x — 3). 
Exercise 11.5 Consider the dilation equation 
f(a) = f(Qx) + 2f (2x — 1) +2f(2x — 2) + 2f (2x — 3) + f(2x — 4) 
1. What is the solution to the dilation equation? 
2. What is the value of J Fajdx? 


Exercise 11.6 What are the solutions to the following families of dilation equations. 

















1. 
f(x) =f (2x) + f(Qx — 1) 
1 1 1 1 
f(x) =f (22) + zf (2a a eg zf (2a — 2) + 5f (2a — 3) 
1 1 1 1 1 1 
f(x) =7f (22) + gfe 15 qf 22 -= 2) + gfe = 3) + gfe —4)+ gfe — 5) 
1 1 
+ qf (2a — 6) + ¿Hz — 7) 
fe) =,1(05) + $00) +--+ 102) 
2. 
He) =5f (22) + fr- 1) + fs 2) +5 f(s- 3) 
1 3 3 1 
f(x) =f (22) + a es coe qv 2) ¿Hz 8) 
1 1 
f(x) =z f (2x) + a = 11 a = 2) + ¿He — 3) 
1 — 1 — 1 1 
f(x) = (22) F Fe Syst Fae — 2) + gf (2x — 3) 





408 
































F(x) =51(02) + ¿1(27—1) + 52x 2) + fn 3) 
f(a) 3103) - 1011) + 5/(05-2) 31073) 
F(t) =100) - 5 f(20 — 1) +00 2) — 5 fx 3) 
10) 100) - > 500 - 1) + Speen -2) - peer -3) 
a 
f(x) nsf + Sf (20-1) + Sf (20-2) + gis -3) 
He) =5f(2e) ~ flr- 1) + Sfr- 2) -Ef 3) 
fle) =$ (20) - Ff (2x 1) + 51202) — Š fr 3) 
Fe) =H pan) - pr) + 2 pen 2) - pee 3) 


Exercise 11.7 


1. What is the solution to the dilation equation f(x) = $f(2x) + ¿f(Qu — 1)? Hint: 
Write a program to see what the solution looks like. 


2. How does the solution change when the equation is changed to f(x) = t f (2a) + 
5f(2x — 1)? 


3. How does the solution change if the coefficients no longer sum to two as in f(x) = 
f(x) + 3f(Qx — 1)? 


Exercise 11.8 If f(x) is frequency limited by 27, prove that 


E ie — k)) 
a 


Hint: Use the Nyquist sampling theorem which states that a function frequency limited by 
2r is completely determined by samples spaced one unit apart. Note that this result means 


that = OE) 
- J fe) ae 


Exercise 11.9 Compute an approximation to the scaling function that comes from the 
dilation equation 


1 an) pe an e 8 (20 = 04 WS gar — 3 = 











olz) = 
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Exercise 11.10 Consider f(x) to consist of the semi circle (x — 3)? + y?=3 andy > 0 


2 
for0<zx<1 and 0 otherwise. 


1. Using precision j = 4 find the coefficients for the scale functions and the wavelets 
for D4 defined by the dilation equation 


: + V8 (20) de E a —1)+ E -Lyr — 2) + WE (20 — 3) 











p(1) = 


2. Graph the approximation to the semi circle for precision j = 4. 


Exercise 11.11 What is the set of all solutions to the dilation equation 


Pdo eae tas) ie Hyr ie elo ae WE sar _3) 











Exercise 11.12 Prove that if scale functions defined by a dilation equation are orthogo- 
nal, then the sum of the even coefficients must equal the sum of the odd coefficients in the 
dilation equation. That is, X` cor = > Cok+1- 

k k 


function = wavelets 


acc=32; accuracy of computation 
phit=[1:acc zeros(1,3*acc)]; 


c1=(1+370.5)/4; c2=(3+370.5)/4; c3=(3-370.5)/4; c4=(1-3°0.5)/4; 
for i=1:10 
temp=(phit (1:2:4*acc)+phit(2:2:4*acc))/2; 
phi2t=[temp zeros(1,3*acc)]; 
phi2tshifti=[ zeros(1,acc) temp zeros(1,2*acc)]; 
phi2tshift2=[ zeros(1,2*acc) temp zeros(1,acc)]; 
phi2tshift3=[ zeros(1,3*acc) temp ]; 
phit=c1*phi2t+c2*phi2tshifti+c3*phi2tshift2+c4*phi2tshift3; 
plot (phit) 
figure(gcf) 


pause 


end plot(phit) figure(gcf) end 
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12 Appendix 


12.1 Definitions and Notation 
12.1.1 Integers 


non-negative integers 

E NS 
nr A IU BES 
— 


positive integers 


pe 


integers 


12.1.2 Substructures 


A substring of a string is a contiguous string of symbols from the original string. A 
subsequence of a sequence is a sequence of elements from the original sequence in order 
but not necessarily contiguous. With subgraphs there are two possible definitions. We 
define a subgraph of a graph to be a subset of the vertices and a subset of the edges of 
the graph induced by the vertices. An induced subgraph is a subset of the vertices and 
all the edges of the graph induced by the subset of vertices. 


12.1.3 Asymptotic Notation 


We introduce the big O notation here. A motivating example is analyzing the running 
time of an algorithm. The running time may be a complicated function of the input length 
n such as 5n? + 25n? Inn — 6n + 22. Asymptotic analysis is concerned with the behavior 
as n — oo where the higher order term 5n? dominates. Further, the coefficient 5 of 5n* 
is not of interest since its value varies depending on the machine model. So we say that 
the function is O(n). The big O notation applies to functions on the positive integers 
taking on positive real values. 


Definition 12.1 For functions f and g from the natural numbers to the positive reals, 
Fin) is O(g(n)) if there exists a constant c >0 such that for alln, f(n) < cg(n). A 


Thus, f(n) = 5n? + 25n? Inn — 6n + 22 is O(n*). The upper bound need not be tight. 
For example, in this case f(n) is also O(n*). Note in our definition we require g(n) to be 
strictly greater than 0 for all n. 


To say that the function f(n) grows at least as fast as g(n), one uses a notation Omega. 
For positive real valued f and g, f(n) is Q(g(n)) if there exists a constant c > 0 such that 
for all n, f(n) > cg(n). If f(n) is both O(g(n)) and Q(g(n)), then f(n) is O(g(n)). Theta 
is used when the two functions have the same asymptotic growth rate. 


Many times one wishes to bound the low order terms. To do this, a notation called 


little o is used. We say f(n) is o(g(n)) if lim ue = 0. Note that f(n) being O(g(n)) 
n—>>00 
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asymptotic upper bound 
f(n) is O(g(n)) if for all n, f(n) < cg(n) for some constant c > 0. 


IA 


asymptotic lower bound 
f(n) is Q(g(n)) if for all n, f(n) > cg(n) for some constant c > 0. 


IV 


asymptotic equality 
f(n) is O(g(n)) if it is both O(g(n)) and Q(g(n)). = 


f(n) is o(g(n)) if lim Y = 0. < 


noo 9n) 


f(n) ~ g(n) if lim 4% =1. = 
noo 99) 
f(n) is w(g(n)) if lim ue = œ. > 


n—>00 











means that asymptotically f(n) does not grow faster than g(n), whereas f(n) being 
o(g(n)) means that asymptotically f(n)/g(n) goes to zero. If f(n) = 2n + yn, then f(n) 
is O(n) but in bounding the lower order term, we write f(n) = 2n + o(n). Finally, we 
write f(n) ~ g(n) if lim An = land say f(n) is w(g(n)) if lim fn} = œ. The difference 
between f(n) being O(g(n)) and f(n) ~ g(n) is that in the first case f(n) and g(n) may 
differ by a multiplicative constant factor. We also note here that formally, O(g(n)) is a 
set of functions, namely the set of functions f such that f(n) is O(g(n)); that is, “f(n) is 
O(g(n))” formally means that f(n) € O(g(n)). 
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12.2 Useful Relations 


Summations 


n 

f 1 — art! 

4 __ 2 = 
yas ata | = , ase 
i=0 
CO 


i 2 1 
yaa ir la] < 1 
i=0 E 
<r 2 3 = a 
poe TEU E 
= 1 
Sat a+40? +908... = SET) lal < 1 
z (1 — ay? 


7. ld 
2A 

















la| <1 








We prove one equality: 


A 2 Be a 
xe =a-+ 2a + 3a E 


, provided |a| < 1. 


Proof: Write S = Y ia‘. So, 


1=0 


as = in ia 
i=1 


Thus, 





S-—as = Dia Suia -$ i= 


i=1 
from which the equality follows. The sum >> ¿%a* can also be done by an extension of this 


method (left to the reader). Using generating functions, we will see another proof of both 
these equalities by derivatives. 


<1 1 Tie kepio ieri ti : 
Ien a (4+4) + (+ 34443) +.21+3+2+--- and thus diverges. 
i=1 





n n 

: 1 : 1 

The summation > i grows as In n since > ja 
= i=l 


1 dx. In fact, lim ES 2 ~ in(n)) = 
i=1 


N—>00 


a fom 
y where y S 0.5772 is Euler’s constant. Thus 2 + = In(n) + y for large n 
Truncated Taylor series 


If all the derivatives of a function f(x) exist, then we can write 
1 m a? 
Fo =H(00)+£02+ (03 +. 


The series can be truncated. In fact, there exists some y between 0 and x such that 


f(z) = f(0) + fe. 
Also, there exists some z between 0 and x such that 
2 
f(z) = F0) + POr fe) 


and so on for higher derivatives. This can be used to derive inequalities. For example, if 
f(x) = In(1+ 2), then its derivatives are 


, = 1 _ pn e 1 m = 2 
LOST ed fie (1+2)3° 


For any x, f"(x) < 0 and thus f(x) < f(0) + f'(0)x, hence In(1 +x) < x, which also 
follows from the inequality 1+ < e”. Also using 
nF r’ 
Ho) = f() OE POS P OT 
for x > —1, f" (x) > 0, and so for z > —1, 
2 
In(1+x)>x- a 


Exponentials and logs 


qics® = plosa 





e = 1+r4 f paii e x 2.718 t = 0.3679. 


Setting x = 1 in the equation e” = 1 +24 a i z 





bo... E DES 


lim (1+ 2)” = e° 





n—>>00 
1 1 1 
(140) =g- 0 +0 q lao 


The above expression with —x substituted for x gives rise to the approximations 
In(1—x)<-—zx 
which also follows from 1 — x < e”, since In(1 — x) is a monotone function for x € (0, 1). 


For 0 < x < 0.69, In(1 — x) > =z — r°. 


Trigonometric identities 


ii) = sin(x) cos(y) + cos(x) sin(y) 
y) = cos(x) cos(y) + sin(x ) sin(y) 
cos (29) = cos? 0 — sin? 0 = 1 — 2sin? 0 
sin (20) = 2 sin 0 cos 0 

22 =L(1-co0s0) 
cos? $ = 3 (1 + cos 6) 


Gaussian and related integrals 


2 1 2 
ax d = ax 
J TE dx an e 











[mee = l tan”! E thus f adr = 5 
a2 4/2 
Mi T thus [Pen 
V 47 





9n+1 n! 


Y we 1-3-5---(Qn—1 2 2n+1 
frotar (2n ) na gen)! n)! (5) 
0 


52 
f e dr = VT 
CO = co 2 
To verify f e dz = yr, consider (J edn) = T fe (+4? )drdy. Let x = 
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rcos@ and y = rsin. The Jacobian of this transformation of variables is 








Ox Ox ; 
los |_| CORES Sime | 
Ie) =| bs y |- sinó  rcos0 = 
Thus, 
00 2 00 00 oo 27 
E “da | = f | cemady=] forza 0) drd0 
oo =00 —00 0 0 
00 2m 
= forrar fas 
0 0 


Thus, f e dx = yr. 


The integral SE x"dx converges if r < —1 — e and diverges if r > —1 +€ 


R 1 
f a dx = ——x"*! 
1 r+1 


Thus 77°, ae converges since J) 72,7 < f, xdg and DO, p= diverges since ) 2, 5 > 
00 
JE ds, 


A | 
1 


00 _11 
EWS 
si 00 
1 SN 
€ 


1 

















Miscellaneous integrals 


. A awe P(a)1P (8) 
iz oe ES Tr(a + £) 


For definition of the Gamma function, I(x), see Section 12.3. 
Binomial coefficients 


The binomial coefficient (7) = Gm is the number of ways of choosing k items from n. 
When choosing d+ 1 items from a set of n + 1 items for d < n one can either choose the 
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first item and then choose d out of the remaining n or not choose the first item and then 
choose d+ 1 out of the remaining n. Therefore, for d < n, 


n $ n {n+l 
d d+1) Ld+1)' 
The observation that the number of ways of choosing k items from 2n equals the 


number of ways of choosing i items from the first n and choosing k — i items from the 
second n summed over all 7, 0 < i < k yields the identity 


206)-6) 


Setting k = n in the above formula and observing that C) = (2a yields 


20 = 0) 


k 
More generally $` (") (a = (er by a similar derivation. 
i=0 





k 


12.3 Useful Inequalities 
14+ 2 < e? for all real z. 
One often establishes an inequality such as 1+ x” < e” by showing that the dif- 
ference of the two sides, namely e” — (1 + x), is always positive. This can be done 
by taking derivatives. The first and second derivatives are e” — 1 and e”. Since e” 
is always positive, e” — 1 is monotonic and e” — (1 + x) is convex. Since e” — 1 is 


monotonic, it can be zero only once and is zero at x = 0. Thus, e” — (1 + x) takes 
on its minimum at x = 0 where it is zero establishing the inequality. 


(l—a)">1-—nz for0<a2<1 


Let g(x) = (1 — x)” — (1 — nz). We establish g(x) > 0 for x in [0,1] by taking 
the derivative. 


g(2)=-n(1-0 +n=n(1- (1-2) > 0 


for 0 <a < 1. Thus, g takes on its minimum for «x in [0,1] at x = 0 where g(0) = 0 
proving the inequality. 
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1+x< e? for all real x 
(1-1) >1-—nxfor0<x<1 
(1 +y)? < 2a? + 2y? 
Triangle Inequality Ix + y| < |x| + ly]. 
Cauchy-Schwartz Inequality Ix|ly| > x"y 
Young's Inequality For positive real numbers p and q where : + a = I and 
positive reals x and y, 

1 1 

zy < =r? + y". 

P q 

Hölder’s inequality For positive real numbers p and q with z + 7 = 
n n 1/p n 1/q 
`X lrivi] < (>: er) (>: mr) . 
i=1 i=1 i=1 
Jensen’s inequality For a convex function f, for ai +... + Qn = 1, @; > 0, 
f bs on < Saf (xi), 
i=1 i=1 
(x+y)? < 22? + 2y? 
The inequality follows from (x + y)? + (x — y)? = 2x? + 2y?. 
Lemma 12.1 For any non-negative reals a1,42,..., a, and any p € [0,1], (Y a)” < 


J as. 


Proof: We will see that we can reduce the proof of the lemma to the case when only 
one of the a; is non-zero and the rest are zero. To this end, suppose a, and az are both 
positive and without loss of generality, assume a, > as. Add an infinitesimal positive 
amount e to a; and subtract the same amount from as. This does not alter the left hand 
side. We claim it does not increase the right hand side. To see this, note that 


(a, + €)? + (ag — ©)? — af — ab = plaf! — a8 e+ O(€), 


418 


and since p — 1 < 0, we have af” * — a8~' < 0, proving the claim. Now by repeating this 


process, we can make az = 0 (at that time a, will equal the sum of the original a, and 
a2). Now repeating on all pairs of a;, we can make all but one of them zero and in the 
process, the left hand side remains the same and the right hand side has not increased. 
So it suffices to prove the inequality at the end which clearly holds. This method of proof 
is called the variational method. A 


The Triangle Inequality 

For any two vectors x and y, |k+y| < |x| + |y|. This can be seen by viewing x 
and y as two sides of a triangle; equality holds iff the angle a between x and y is 180 
degrees. Formally, by the law of cosines we have |x + yl? = |x|? + |y|? — 2|x||y| cos(a) < 
Ix]? + [y |? + 2|x||y| = (x| + ly|)?. The inequality follows by taking square roots. 
Stirling approximation 


n\n 2n 1 
I2(= 2 =x 9 
à (=) sd (7) A/TN 
n n 1 
vy TTA <n! <v A (1 ) 
e” e” 





12n — 1 


We prove the inequalities, except for constant factors. Namely, we prove that 


1.4 (va <n!<e Ol vn. 


e 





Write In(n!) = ln1 +ln2 +---+ Inn. This sum is approximately a Inx dx. The 
indefinite integral f lng de = («nx — x) gives an approximation, but without the yn 
term. To get the yn, differentiate twice and note that ln x is a concave function. This 
means that for any positive Zo, 


In zo + 1n(xo + 1) 3 e 
2 ne x 





ln z dz, 
=X0 
since for x € [2p, To + 1], the curve ln x is always above the spline joining (xo, ln xo) and 
(Zo + 1, In(xo + 1)). Thus, 
Ind Inl+ln2 In2+1n3 In(n—1)+Inn Inn 
In(n!) = Pee | 
2 2 2 2 2 
é l l 
</ Ing dr + —— =|zInz—2]? + — 
a=1 2 2 





l 
=nlnn—n+1+—. 


Thus, n! < n"e~"\/ne. For the lower bound on n!, start with the fact that for any 
zo > 1/2 and any real p 


1 xo+.5 
ln zo > z alzo + p)+1In(t0 — p)) implies ln zo > f Ina dz. 


x=xro—0.5 
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Thus, 
n+.5 


In(n!) =In2+In3 ++ Inn > f ln z dz, 


g= 15 
from which one can derive a lower bound with a calculation. 
Stirling approximation for the binomial coefficient 


n eny* 
s (7) 
(= 
Using the Stirling approximation for k!, 


n n! nk en  k 
- tm se (2) 
iol RL NR 





The Gamma function 


For a > 0 


T (a) = a 
0 
1(2) =vm, PQ) =C(2)=1, and forn>2, T(a)=(a—1)T (a—1). 
To prove I (a) = (a — 1)T (a — 1) use integration by parts. 


rosas) 109) 


Write Pla) = [™, f(u)g (1) de, where, f(z) = 3%! and g'(x) = e~*. Thus, 


00 


has i gta = [falal(a je, + f (a —1)2*e-* de 


as claimed. 


Cauchy-Schwartz Inequality 


(+) Ge)= (Ze) 


In vector form, |x||y| > x7y, the inequality states that the dot product of two vectors 
is at most the product of their lengths. The Cauchy-Schwartz inequality is a special case 
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of Holder’s inequality with p = q = 2. 
Young’s inequality 
For positive real numbers p and q where : + . = 1 and positive reals x and y, 


1 1 
=g? + y" > Ty. 
P q 


The left hand side of Young’s inequality, a + a is a convex combination of x? and y? 


since + and + sum to 1. In(x) is a concave function for x > 0 and so the In of the convex 
combination of the two elements is greater than or equal to the convex combination of 
the In of the two elements 


1 1 1 1 
In(=x? + =y”) > — ln(x”) + — In(y*%) = In(zy). 
o ; ) 7 (2?) ; (y*) = n(zy) 


Since for x > 0, ln z is a monotone increasing function, red + Y > Ty.. 
Hölder’s inequality 


For positive real numbers p and q with > + o =A. 


/4 


n n 1/p n 1 
Eos < (Eur) (Ewe) 
i=1 i=1 i=1 


Let x! =a; / 07%, [PP and yf = yi / O, ly)". Replacing x: by æ; and y; by 


y; does not change the inequality. Now Xi; [14]? = X; ly]? = 1, so it suffices to prove 
id t ch the i lity. N ny EP ¿1 1y¿]% = 1, so it suffices t 
Xi lziy;| < 1. Apply Young’s inequality to get |xiy/| < a + — Summing over i, the 


right hand side sums to E + A = 1 finishing the proof. 


For a¡,42,..., a, real and k a positive integer, 





(a, tay +--+ + an)? <n (Jalë + Jaa" +- lanl"). 
Using Hölder’s inequality with p = k and q = k/(k — 1), 


lai +a2 +- + an| < lai- 1| + laz: 1| +--+ lan: 1| 


ma 1/k 
(Eir) arto, 
i=1 


from which the current inequality follows. 


IA 
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m—1 m n n+l 


Figure 12.1: Approximating sums by integrals 


Arithmetic and geometric means 


The arithmetic mean of a set of non-negative reals is at least their geometric mean. 
For a1, a2,..., 0, 


1 n 
E > a; 2 /4142*** Gn. 
ne 

i=1 


Assume that a, > 42 >... > an. We reduce the proof to the case when all the a; are equal 
using the variational method. In this case the inequality holds with equality. Suppose 
a, > as. Let e be a positive infinitesimal. Add e to az and subtract e from a, to get closer 
to the case when they are equal. The left hand side 45>", a; does not change. 


(a, — e) (as + e)azas +++ An = 0189 ` ` an + Elai — a2)azas: an + Ole?) 
> 0102*** An 
for small enough e > 0. Thus, the change has increased ¿/a,a2 --- an. So if the inequality 


holds after the change, it must hold before. By continuing this process, one can make all 
the a; equal. 


Approximating sums by integrals 


For monotonic decreasing f(x), 


n+1 n 





fædre < X fi) / f(x)dx 
z=m 408 x=m-—1 
See Fig. 12.1. Thus, 
n+1 i 
E paagt greta Ss J aa 
i=2 a=1 


3 


aS oe 
SA aS 
i=1 


sie 


3 1 
and hence al 


2 


Jensen’s Inequality 


For a convex function f, 


TECER < 50 (0)+1(0): 


More generally for any convex function f, 
f (>: on) < Saf (x3), 
i=1 i=1 


where 0 < a; < 1 and > a; = 1. From this, it follows that for any convex function f and 
i=l 
random variable x, 


Elf (0) > F(E (2). 
We prove this for a discrete random variable x taking on values a,,az,... with Prob(x = 
ai) = Qi: 


EE) = Dafa) > 1 = a) = f(E(e)). 











Figure 12.2: For a convex function f, f (2%) < 4(f (x1) + f (22). 


Example: Let f(x) = x" for k an even positive integer. Then, f”"(x) = k(k — 1)x*2 
which since k — 2 is even is non-negative for all x implying that f is convex. Thus, 


E (a) < VE (z*), 


since té is a monotone function of t, t > 0. It is easy to see that this inequality does not 
necessarily hold when k is odd; indeed for odd k, x” is not a convex function. A 
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Tails of Gaussians 


For bounding the tails of Gaussian densities, the following inequality is useful. The 
proof uses a technique useful in many contexts. For t > 0, 


oo e? 
2 
| e” dr < —. 
x=t 2t 


In proof, first write: [%, e? dr < fii dx, using the fact that x > t in the range of 
integration. The latter expression is integrable in closed form since d(e7™®*) = (—2x)e~”” 
yielding the claimed bound. 


A similar technique yields an upper bound on 


1 
f (1 — 23? dz, 
r=B 


for 8 € [0,1] anda > 0. Just use (1—2?)* < a x”) over the range and integrate the 
last expression. 


1 


1 1 
20 T a o a h og Gaati 
aa ds f 505% dx = Marne x") ü En 


(1 = Bjar 
26(a +1) 


12.4 Probability 


Consider an experiment such as flipping a coin whose outcome is determined by chance. 
To talk about the outcome of a particular experiment, we introduce the notion of a ran- 
dom variable whose value is the outcome of the experiment. The set of possible outcomes 
is called the sample space. If the sample space is finite, we can assign a probability of 
occurrence to each outcome. In some situations where the sample space is infinite, we can 
assign a probability of occurrence. The probability p (i) = £3 for 2 an integer greater 
than or equal to one is such an example. The function assigning the probabilities is called 
a probability distribution function. 


In many situations, a probability distribution function does not exist. For example, 
for the uniform probability on the interval [0,1], the probability of any specific value is 
zero. What we can do is define a probability density function p(x) such that 


b 
Prob(a < xz < b) = [owas 
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If x is a continuous random variable for which a density function exists, then the cumu- 
lative distribution function f (a) is defined by 


which gives the probability that x < a. 


12.4.1 Sample Space, Events, and Independence 


There may be more than one relevant random variable in a situation. For example, if 
one tosses n coins, there are n random variables, 11,,12,...,1,, taking on values 0 and 1, 
a 1 for heads and a 0 for tails. The set of possible outcomes, the sample space, is {0,1}”. 
An event is a subset of the sample space. The event of an odd number of heads, consists 
of all elements of {0,1}" with an odd number of 1's. 


Let A and B be two events. The joint occurrence of the two events is denoted by 
(AAB). The conditional probability of event A given that event B has occurred is denoted 
by Prob(A|B)and is given by 


Prob(A A B) 


Prob(A|B) = Prob(B) 


Events A and B are independent if the occurrence of one event has no influence on the 
probability of the other. That is, Prob(A|B) = Prob(A) or equivalently, Prob(A A B) = 
Prob(A)Prob(B). Two random variables x and y are independent if for every possible set 
A of values for x and every possible set B of values for y, the events x in A and y in B 
are independent. 


A collection of n random variables 1,,%2,...,2, is mutually independent if for all 
possible sets A1, Ao,..., A, of values of £1, 2%2,...,%n, 


Prob(x1 € Ai, £2 € Ao,...,2n € An) = Prob(11 € A1)Prob(x2 € Ag)---Prob(an € An). 


If the random variables are discrete, it would suffice to say that for any values a1, d2,..., an 
Prob(zı = a1, £2 = 094414404 = An) = Prob(xı = a) Probe = az): - - Prob(£n = an). 
Random variables £1, %2,..., £n are pairwise independent if for any a; and aj, i # J, 


Prob(x; = a;, 1, = aj) = Prob(x; = a;)Prob(x; = aj). An example of random variables 
11,72, %3 that are pairwise independent but not mutually independent would be if xı and 
£ are the outcomes of independent fair coin flips, and x3 is a {0, 1}-valued random vari- 
able that equals 1 when zı = 22. 





If (x,y) is a random vector and one normalizes it to a unit vector 2 A 
/ £? +y? / x? +y? 


the coordinates are no longer independent since knowing the value of one coordinate 
uniquely determines the absolute value of the other. 
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12.4.2 Linearity of Expectation 


An important concept is that of the expectation of a random variable. The expected 
value, E(x), of a random variable x is E(x) = X` xp(x) in the discrete case and E(x) = 
f xp(x)dx in the continuous case. The expectation of a sum of random variables is equal 
to the sum of their expectations. The linearity of expectation follows directly from the 
definition and does not require independence. 


12.43 Union Bound 


Let A1, Ao,..., An be events. The actual probability of the union of events is given 
by Boole’s formula. 


Prob(A U AgU-::A,) = Prob(A,) — Prob(A,; A A;) + Prob(A;A A; A Ax) —- +> 
j j 
i=1 ij 


ijk 


Often we only need an upper bound on the probability of the union and use 
Prob(A; U A2 U- +- An) < Y Prob(Aj) 
i=1 
This upper bound is called the union bound. 


12.4.4 Indicator Variables 


A useful tool is that of an indicator variable that takes on value 0 or 1 to indicate 
whether some quantity is present or not. The indicator variable is useful in determining 
the expected size of a subset. Given a random subset of the integers {1,2,...,n}, the 
expected size of the subset is the expected value of 11 + £2 +--- + £n where x; is the 
indicator variable that takes on value 1 if 2 is in the subset. 


Example: Consider a random permutation of n integers. Define the indicator function 
x; = 1 if the it integer in the permutation is i. The expected number of fixed points is 
given by 


E ($=) = > Ala) = a = E 


Note that the x; are not independent. But, linearity of expectation still applies. E 


Example: Consider the expected number of vertices of degree d in a random graph 
G(n, p). The number of vertices of degree d is the sum of n indicator random variables, one 
for each vertex, with value one if the vertex has degree d. The expectation is the sum of the 
expectations of the n indicator random variables and this is just n times the expectation 
of one of them. Thus, the expected number of degree d vertices is no -pi E 
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12.45 Variance 


In addition to the expected value of a random variable, another important parameter 
is the variance. The variance of a random variable x, denoted var(x) or often o?(x) is 
E (x — E(x))” and measures how close to the expected value the random variable is likely 
to be. The standard deviation o is the square root of the variance. The units of ø are the 
same as those of z. 


By linearity of expectation 


o° = E (x — E (£)} = E(x”) — 2E(2)E (a) + Ex) = E (x°) — E? (2). 


Ja 


For the probability distribution Prob(w = i) = $; 
distributions Prob(w = i) = c and Prob(«; = 2”) 
infinite variance. 


E(x) = oo. The probability 
4 have finite expectation but 


> 
N| 


12.4.6 Variance of the Sum of Independent Random Variables 


In general, the variance of the sum is not equal to the sum of the variances. However, 
if x and y are independent, then E (xy) = E (x) E (y) and 


var(x + y) = var (x) + var (y). 
To see this 
var(1 + y) = E ((a+y)) — E’ (£ + y) 

= Ela?) + 2E(1y) + Ely”) — E (z) — 2E(1)E(y) — E’ (y). 
From independence, 2E(1y) — 2E(x)E(y) = 0 and 

var(x + y) = E(x”) — E’ (x) + Ely”) — Ey) 

= var(x) + var(y). 
More generally, if £1, %2,..., £n are pairwise independent random variables, then 
var(zı + £2 +--+ £n) = var(x1) + var(x3) +--+ +var(x,). 

For the variance of the sum to be the sum of the variances only requires pairwise inde- 
pendence not full independence. 
12.4.7 Median 


One often calculates the average value of a random variable to get a feeling for the 
magnitude of the variable. This is reasonable when the probability distribution of the 
variable is Gaussian, or has a small variance. However, if there are outliers, then the 
average may be distorted by outliers. An alternative to calculating the expected value is 
to calculate the median, the value for which half of the probability is above and half is 
below. 
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12.4.8 The Central Limit Theorem 


Let s =11,+12+-:-: +2, be a sum of n independent random variables where each x; 
has probability distribution 


__ J 0 with probability 0.5 
“24 4 with probability 0.5 ` 


The expected value of each z; is 1/2 with variance 


Py cl sys tng e 
TEND "NG 27 4 


The expected value of s is n/2 and since the variables are independent, the variance of 
the sum is the sum of the variances and hence is n/4. How concentrated s is around its 
mean depends on the standard deviation of s which is Ya. For n equal 100 the expected 
value of s is 50 with a standard deviation of 5 which is 10% of the mean. For n = 10,000 
the expected value of s is 5,000 with a standard deviation of 50 which is 1% of the 
mean. Note that as n increases, the standard deviation increases, but the ratio of the 
standard deviation to the mean goes to zero. More generally, if x, are independent and 
identically distributed, each with standard deviation o, then the standard deviation of 
21 +22 +--+ £n is yno. So, caer eae has standard deviation ø. The central limit 


11 +L9+-:+HUn 
n 


theorem makes a stronger assertion that in fact has Gaussian distribution 


with standard deviation ø. 


Theorem 12.2 Suppose x1, %2,...,%n 18 a sequence of identically distributed independent 
random variables, each with mean y and variance o?. The distribution of the random 
variable 





mr + to + + Ly ny) 


converges to the distribution of the Gaussian with mean 0 and variance o?. 


12.4.9 Probability Distributions 


The Gaussian or normal distribution 


The normal distribution is 





1 at (e—m)? 
—— e 2 02 
v2ro 
where m is the mean and øg? is the variance. The coefficient aS makes the integral of 
the distribution be one. If we measure distance in units of the standard deviation o from 


the mean, then the standard normal distribution f(x) with mean zero and variance one 
is 


Standard tables give values of the integral 


Jut 


and from these values one can compute probability integrals for a normal distribution 
with mean m and variance 0?. 


General Gaussians 


So far we have seen spherical Gaussian densities in R*. The word spherical indicates 
that the level curves of the density are spheres. If a random vector y in R” has a spherical 
Gaussian density with zero mean, then y; and y;, i # j, are independent. However, in 
many situations the variables are correlated. To model these Gaussians, level curves that 
are ellipsoids rather than spheres are used. 


For a random vector x, the covariance of x; and zj is E((x; — mi) (x; — uj)). We list 
the covariances in a matrix called the covariance matrix, denoted X.% Since x and p are 
column vectors, (x — p)(x — up)” is a d x d matrix. Expectation of a matrix or vector 
means componentwise expectation. 


X = E((x — p)(x— uy”). 
The general Gaussian density with mean p and positive definite covariance matrix Y is 
1 


1 Tvo-l 
18) or (56-21). 





To compute the covariance matrix of the Gaussian, substitute y = Y71%(x— yw). Noting 
that a positive definite symmetric matrix has a square root: 


E(x — p)(x— p)” = EE Pyy"E!0) 
= y1/2 (E(yy")) y1/2 =S 
The density of y is the unit variance, zero mean Gaussian, thus E(yy?) = J. 
Bernoulli trials and the binomial distribution 
A Bernoulli trial has two possible outcomes, called success or failure, with probabilities 


p and 1 — p, respectively. If there are n independent Bernoulli trials, the probability of 
exactly k successes is given by the binomial distribution 


B (n, p) = (i pio" 





47Y is the standard notation for the covariance matrix. We will use it sparingly so as not to confuse 
with the summation sign. 
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The mean and variance of the binomial distribution B(n, p) are np and np(1 — p), respec- 
tively. The mean of the binomial distribution is np, by linearity of expectations. The 
variance is np(1 — p) since the variance of a sum of independent random variables is the 
sum of their variances. 


Let xı be the number of successes in ny trials and let zə be the number of successes 
in na trials. The probability distribution of the sum of the successes, 11 + £2, is the same 
as the distribution of x; + x2 successes in nı + na trials. Thus, B (nı, p) + B(n2,p) = 
B (nı + ng, p). 


When p is a constant, the expected degree of vertices in G (n, p) increases with n. For 
example, in G (n, 5), the expected degree of a vertex is (n — 1)/2. In many applications, 
we will be concerned with G (n, p) where p = d/n, for d a constant; i.e., graphs whose 
expected degree is a constant d independent of n. Holding d = np constant as n goes to 
infinity, the binomial distribution 


n 


Prob (k) = 


n—k 
)” (1 — p) 
approaches the Poisson distribution 


(np)? -np _ d a 





To see this, assume k = o(n) and use the approximations n -— k = n, (7) = ES and 


(1 — ee = e! to approximate the binomial distribution by 


k k k 
; n\ k nk n d din č d d 
lim (o) (1 —p) =% (5) 12) =e 


Note that for p = 2, where d is a constant independent of n, the probability of the bi- 
nomial distribution falls off rapidly for k > d, and is essentially zero for all but some 
finite number of values of k. This justifies the k = o(n) assumption. Thus, the Poisson 
distribution is a good approximation. 

Poisson distribution 


The Poisson distribution describes the probability of k events happening in a unit of 
time when the average rate per unit of time is A. Divide the unit of time into n segments. 
When n is large enough, each segment is sufficiently small so that the probability of two 
events happening in the same segment is negligible. The Poisson distribution gives the 
probability of k events happening in a unit of time and can be derived from the binomial 
distribution by taking the limit as n > oo. 
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Let p = à, Then 


k n—k 
Prob(k successes in a unit of time) = lim (+) (>) (: — *) 





n—>00 n n 
k n —k 
tim Ma Hk FY) A ¡A LA 
k 
= lim Te? 


In the limit as n goes to infinity the binomial distribution p (k) = (;)p* (1 — p)”* be- 
comes the Poisson distribution p (k) = gprs The mean and the variance of the Poisson 
distribution have value A. If x and y are both Poisson random variables from distributions 
with means A, and Az respectively, then x + y is Poisson with mean mı + ma. For large 


n and small p the binomial distribution can be approximated with the Poisson distribution. 
The binomial distribution with mean np and variance np(1 — p) can be approximated 
by the normal distribution with mean np and variance np(1—p). The central limit theorem 


tells us that there is such an approximation in the limit. The approximation is good if 
both np and n(1 — p) are greater than 10 provided k is not extreme. Thus, 


BOO" te 
= e am, 
kJ \2 2 mm /2 


This approximation is excellent provided k is O(n). The Poisson approximation 


n k (1 = yl Y ¿np (np) 
CS k! 
is off for central values and tail values even for p = 1/2. The approximation 


n nekono 1 _ (n-k)? 
jean et 








Tpn 

is good for p = 1/2 but is off for other values of p. 

Generation of random numbers according to a given probability distribution 
Suppose one wanted to generate a random variable with probability density p(x) where 

p(x) is continuous. Let P(x) be the cumulative distribution function for x and let u be 


a random variable with uniform probability density over the interval [0,1]. Then the ran- 
dom variable x = P~! (u) has probability density p(x). 
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Example: For a Cauchy density function the cumulative distribution function is 


FAL i i 
P (x)= dt = = + tan (x). 


t=—00 





Setting u = P(x) and solving for x yields x = tan (7 (u — 5»). Thus, to generate a 
random number x > 0 using the Cauchy distribution, generate u, 0 < u < 1, uniformly 
and calculate x = tan (m (u— 3)) . The value of x varies from —oo to oo with P(0) = 1/2. 


12.4.10 Bayes Rule and Estimators 


Bayes rule 


Bayes rule relates the conditional probability of A given B to the conditional proba- 
bility of B given A. 
Prob (B|A) Prob (A) 

Prob (B) 

Suppose one knows the probability of A and wants to know how this probability changes 
if we know that B has occurred. Prob(A) is called the prior probability. The conditional 
probability Prob(A|B) is called the posterior probability because it is the probability of 
A after we know that B has occurred. 


Prob (A|B) = 





The example below illustrates that if a situation is rare, a highly accurate test will 
often give the wrong answer. 
Example: Let A be the event that a product is defective and let B be the event that a 
test says a product is defective. Let Prob(B|A) be the probability that the test says a 
product is defective assuming the product is defective and let Prob (B |A) be the proba- 
bility that the test says a product is defective if it is not actually defective. 


What is the probability Prob(A|B) that the product is defective if the test says it is 
defective? Suppose Prob(A) = 0.001, Prob(B|A) = 0.99, and Prob (B|A) = 0.02. Then 


Prob (B) = Prob (BIA) Prob (A) + Prob (B|A) Prob (A) 

= 0.99 x 0.001 + 0.02 x 0.999 

= 0.02087 
and 
Prob (B|4) Prob (A) _ 0.99 x 0.001 

Prob (B) 0.0210 

Even though the test fails to detect a defective product only 1% of the time when it 
is defective and claims that it is defective when it is not only 2% of the time, the test 
is correct only 4.7% of the time when it says a product is defective. This comes about 
because of the low frequencies of defective products. A 





Prob (A|B) = = 0.0471 
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The words prior, a posteriori, and likelihood come from Bayes theorem. 


likelihood x prior 





a posteriori = — 
normalizing constant 


Prob (B|A) Prob (A) 
Prob (B) 


The a posteriori probability is the conditional probability of A given B. The likelihood 
is the conditional probability Prob(B|A). 


Prob (A|B) = 





Unbiased Estimators 


Consider n samples x1, 12,..., £n from a Gaussian distribution of mean y and variance 


2 ater. is an unbiased estimator of y, which means 


af. For this distribution, m = 


that E(m) = u and +Ð (z; — u)? is an unbiased estimator of o?. However, if yu is not 
i=1 


3 


known and is approximated by m, then + Ð (a; — m)’ is an unbiased estimator of o°. 
i=1 


2 


Maximum Likelihood Estimation MLE 


Suppose the probability distribution of a random variable x depends on a parameter 
r. With slight abuse of notation, since r is a parameter rather than a random variable, we 
denote the probability distribution of x as p(x|r). This is the likelihood of observing « if 
r was in fact the parameter value. The job of the maximum likelihood estimator, MLE, 
is to find the best r after observing values of the random variable x. The likelihood of r 
being the parameter value given that we have observed x is denoted L(r|x). This is again 
not a probability since r is a parameter, not a random variable. However, if we were to 
apply Bayes’ rule as if this was a conditional probability, we get 


js A 





Now, assume Prob(r) is the same for all r. The denominator Prob(x) is the absolute 
probability of observing x and is independent of r. So to maximize L(r|x), we just maxi- 
mize Prob(x|r). In some situations, one has a prior guess as to the distribution Prob(r). 
This is then called the “prior” and in that case, we call Prob(x|r) the posterior which we 
try to maximize. 


Example: Consider flipping a coin 100 times. Suppose 62 heads and 38 tails occur. 
What is the maximum likelihood value of the probability of the coin to come down heads 
when the coin is flipped? In this case, it is r = 0.62. The probability that we get 62 heads 
if the unknown probability of heads in one trial is r is 


1 
Prob (62 heads|r) = ( ms r2(1 — py, 
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This quantity is maximized when r = 0.62. To see this take the derivative with respect 
to r of r*2(1—r)*% The derivative is zero at r = 0.62 and the second derivative is negative 
indicating a maximum. Thus, r = 0.62 is the maximum likelihood estimator of the 
probability of heads in a trial. E 


12.5 Bounds on Tail Probability 
12.5.1 Chernoff Bounds 


Markov’s inequality bounds the probability that a non-negative random variable exceeds 
a value a. 


Els) 





p(z > a) < 
p (£ > aE(=)) < : 


If one also knows the variance, 07, then using Chebyshev’s inequality one can bound the 
probability that a random variable differs from its expected value by more than a standard 
deviations. Let m = E(x). Then Chebyshev’s inequality states that 


1 
plo —m| > 00) < 5 


If a random variable s is the sum of n independent random variables £1, £2,..., £n of 
finite variance, then better bounds are possible. Here we focus on the case where the 
n independent variables are binomial. In the next section we consider the more general 
case where we have independent random variables from any distribution that has a finite 
variance. 


Let 11, 12,..., £n be independent random variables where 
J 0 Probl p 
Ti™= 1 Prob p ` 


Consider the sum s = ` x;. Here the expected value of each x; is p and by linearity 


i=1 
of expectation, the expected value of the sum is m=np. Chernoff bounds bound the 
probability that the sum s exceeds (1 + 6)m or is less than (1 — ô)m. We state these 
bounds as Theorems 12.3 and 12.4 below and give their proofs. 


Theorem 12.3 For any 6 > 0, Prob (s > (1 +0)m) < ($) 


z m 2 
Theorem 12.4 Let 0 < y < 1, then Prob (s < (1— y)m) < (m) E 
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Proof (Theorem 12.3): For any A > 0, the function eò" is monotone. Thus, 


Prob (s > (1+6)m) = Prob (e™ > e09) , 


eò? is non-negative for all x, so we can apply Markov’s inequality to get 


Prob (en > AE < e AU+0)m py (e>) ; 


Since the x; are independent, 


Using the inequality 1 + £ < e” with x = p(e* — 1) yields 


E (eo) < [[9e=>. 
i=1 


Thus, for all A > 0 


Prob (s > (1+ ôm) Prob (e > erin) 


e Atim py (e?) 


< e 7M148)m [Jeo 
i=1 





Setting A = In(1 + ô) 


Prob (s spas 5)m) < (E ANA) 








To simplify the bound of Theorem 12.3, observe that 





NE Oe 
ASEOS e ees 
Therefore 
(14D = et -St8 
and hence 
0 = co" 


Thus, the bound simplifies to 
2 3 
Prob (s < (1+0)m) < emm 
For small 6 the probability drops exponentially with 6°. 


When ô is large another simplification is possible. First 


eô y z (1+6)m 
Prob (s > (1+6)m) < AA < (5) 


If ô > 2e — 1, substituting 2e — 1 for ô in the denominator yields 
Prob(s > (1+6)m) < 24+%m, 


Theorem 12.3 gives a bound on the probability of the sum being significantly greater 
than the mean. We now prove Theorem 12.4, bounding the probability that the sum will 
be significantly less than its mean. 


Proof (Theorem 12.4): For any \ > 0 
Prob (s <(1- ym) = Prob ( —s>-(1- y)m) = Prob te > ee) , 


Applying Markov’s inequality 





Prob (s < (1— y)m) < <i 


Now 
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Thus, 


Prob(s < (1 — y)m) < 


Since 1+ 2 < e” 


Prob (s < (1— y)m) < 


Setting \ = In 


2 


=s 


ple = 1) 





Prob (s < (1— y)m) < 


< 


eAU—y)m 


erp(e*—1) 


PEET 


ero(1=11) 


ed RE 


en 


2 
But for 0 < y < 1, (1 — y]0-0 > e-7**. To see this note that 


T 


It then follows that 


Prob (s < (1— y)m) < ( 


IV 
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12.5.2 More General Tail Bounds 


The main purpose of this section is to state the Master Tail bounds theorem of Chapter 
2 (with more detail), give a proof of it, and derive the other tail inequalities mentioned 
in the table in that chapter. Recall that Markov’s inequality bounds the tail probability 
of a non-negative random variable x based only on its expectation. For a > 0, 


Prisa) < 
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m, 2 
zom 
ae) P 


a 





As a grows, the bound drops off as 1/a. Given the second moment of x, recall that 
Chebyshev’s inequality, which does not assume x is a non-negative random variable, gives 
a tail bound falling off as 1/a? 


E ((2- E(2))’) 


Pr(|e — B(2)| > a) < i 





a 
Higher moments yield bounds by applying either of these two theorems. For example, 

if r is a non-negative even integer, then x” is a non-negative random variable even if x 

takes on negative values. Applying Markov’s inequality to 2”, 

E(x") 

a” 
a bound that falls off as 1/a”. The larger the r, the greater the rate of fall, but a bound 
on E(x") is needed to apply this technique. 


Pr(|x| > a) = Pr(x" Sa < 





y 


For a random variable x that is the sum of a large number of independent random 
variables, 1,,12,...,T,, One can derive bounds on E(x") for high even r. There are many 
situations where the sum of a large number of independent random variables arises. For 
example, x; may be the amount of a good that the i” consumer buys, the length of the i” 
message sent over a network, or the indicator random variable of whether the i” record 
in a large database has a certain property. Each x; is modeled by a simple probability 
distribution. Gaussian, exponential probability density (at any t > 0 is e7*), or binomial 
distributions are typically used, in fact, respectively in the three examples here. If the 
x; have 0-1 distributions, then the Chernoff bounds described in Section 12.5.1 can be 
used to bound the tails of £x = 11 + £2 +- -- + £n. But exponential and Gaussian random 
variables are not bounded so the proof technique used in Section 12.5.1 does not apply. 
However, good bounds on the moments of these two distributions are known. Indeed, for 
any integer s > 0, the s'” moment for the unit variance Gaussian and the exponential are 
both at most s!. 


Given bounds on the moments of individual x; the following theorem proves moment 
bounds on their sum. We use this theorem to derive tail bounds not only for sums of 0-1 
random variables, but also Gaussians, exponentials, Poisson, etc. 


The gold standard for tail bounds is the central limit theorem for independent, iden- 
tically distributed random variables 71, 72,+++ , £n with zero mean and Var(x;) = 0? that 
states as n — 00 the distribution of x = (xı + £2 +- + 2n)/yn tends to the Gaus- 
sian density with zero mean and variance 0?. Loosely, this says that in the limit, the 
tails of x = (1, + £2 ++- + £n)/y/n are bounded by that of a Gaussian with variance 
o”. But this theorem is only in the limit, whereas, we prove a bound that applies for all n. 


In the following theorem, x is the sum of n independent, not necessarily identically 
distributed, random variables x1, %2,...,2n, each of zero mean and variance at most o?. 
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By the central limit theorem, in the limit the probability density of x goes to that of 
the Gaussian with variance at most no?. In a limit sense, this implies an upper bound 
of ce~®/@n9") for the tail probability Pr(|z| > a) for some constant c. The following 
theorem assumes bounds on higher moments, and asserts a quantitative upper bound of 
3e/(12n0%) on the tail probability, not just in the limit, but for every n. We will apply 
this theorem to get tail bounds on sums of Gaussian, binomial, and power law distributed 
random variables. 


Theorem 12.5 Let xv = 2, +%9+---+2%y, where £1, £2,..., £n are mutually independent 
random variables with zero mean and variance at most o°. Suppose a € [0, /2no?] and 
s < no?/2 is a positive even integer and |E(x")| < o?r!, for r =3,4,...,8. Then, 


2sno? ) ae 





Pr (|£ + 22 +-++2y| >a) < ( 


If further, s > a*/(4no7), then we also have: 
Pr gies tse |S a) Be Fe, 


Proof: We first prove an upper bound on E(x") for any even positive integer r and then 
use Markov’s inequality as discussed earlier. Expand (11 + £2 +--:+2,)". 


(it mt tay = | pare 


Ti fayss 
T1 72 r 


De 
= Se t 
rilral rl +? 4 


where the r; range over all non-negative integers summing to r. By independence 


r! 
r\ rı PN 33 Tn 
Bla") = Y PP) Ele). 
If in a term, any r; = 1, the term is zero since E(x;) = 0. Assume henceforth that 
(r,,Y2,...,T,) runs over sets of non-zero r; summing to r where each non-zero r; is at 
least two. There are at most r/2 non-zero r; in each set. Since |E(x7*)| < 0?r;! 


ey 





Ela”) < r! X g? number of non-zero r; in set) 


(r1,12,...,Tn) 


Collect terms of the summation with t non-zero r; for t = 1,2,...,r/2. There are (1) 
subsets of {1,2,...,n} of cardinality t. Once a subset is fixed as the set of t values of i 
with non-zero r;, set each of the r; > 2. That is, allocate two to each of the r; and then 
allocate the remaining r — 2t to the t r; arbitrarily. The number of such allocations is just 


aa ae £ Gren) So, 


t-1 t-1 


Ela") < 53 E E (") (" reer al ge. 
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Thus f(t) < h(t), where h(t) = (no?) or-t-1_ Since t < r/2 < no?/4, we have 


t! 








2 
wa E z a 
So, we get 
riz hd r! 
Ela") =r! = FO <rib(r/2(1+3+7+=0)< a (no?) 


Applying Markov inequality, 





I(no2)"/2Q1/2 ee: 
Pr(|z| > a) = Pr(|2|" > a”) < uno?) 14201? ae ( rno ) | 


(r/2)!la” a? 


This holds for all r < s, r even and applying it with r = s, we get the first inequality of 
the theorem. 

We now prove the second inequality. For even r, g(r)/g(r — 2) = see and so 
g(r) decreases as long as r — 1 < a?/(4no?). Taking r to be the largest even integer 
less than or equal to a?/(6no?), the tail probability is at most e~"/?, which is at most 
e+e @/(12ne*) < 3. e@/(12n0") proving the theorem. E 





12.6 Applications of the Tail Bound 
Chernoff Bounds 


Chernoff bounds deal with sums of Bernoulli random variables. Here we apply Theo- 
rem 12.5 to derive these. 


Theorem 12.6 Suppose yj, y2,---,Yn are independent 0-1 random variables with E(y;) = 
p for alli. Let y = yı +yYy2+:--+Yn. Then for any c € [0,1], 


Prob(|y — E(y)| > enp) < Be ees: 
Proof: Let x; = y; — p. Then, E(x;) = 0 and E(x?) = Ely — p}? = p. For s > 3, 


|E(x})| =|E(y; — p)” 
= |p(1 — p)? + (1 — p)(0 — p)"] 


= |p(1 — p) ((1 =p) + (=p))| 
<p. 


Apply Theorem 12.5 with a = cnp. Noting that a < v2 np, completes the proof. E 
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Section (12.5.1) contains a different proof that uses a standard method based on 
moment-generating functions and gives a better constant in the exponent. 


Power Law Distributions 


The power law distribution of order k where k is a positive integer is 


k-1 


If a random variable x has this distribution for k > 4, then 





k-1 k-1 
= E = AG d V: = . 
A IA ee ee 
Theorem 12.7 Suppose 11,,%2,..., Y, are 1.1.d, each distributed according to the Power 


Law of order k > 4 (with n > 10k?). Then, for x = zı + £2 +: + 3n, and any 
e € (1/(2V nk), 1/k), we have 


(k-3)/2 
Pr(|a — Ele) > eE(2)) < (==) | 


Proof: For integer s, the s'?* moment of x; — E(2x;), namely, E((x; — pu)*), exists if and 
only if s < k — 2. For s < k — 2, 


E S / i ay 


Using the substitution of variable z = u/y 


(y => u)? s—k 8 25 s 
YE y 2) Ee) 
y* pese 


As y goes from 1 to oo, z goes from y to 0, and dz = — dy. Thus 





S 


O E 1) J “w E dy 


— 1 1 se 
tf (i= DEL dz — | CL = dz 
0 H 1 


= pe 





s!(k—2—s)! 
(k—1)! 





The first integral is just the standard integral of the beta function and its value is 
To bound the second integral, note that for z € [1, pu], |z — 1| < oS and 


zk-s-2 < (1 a (1/(k = >) aa x e (H=8s-2)/(k=2) < e. 


— 1)s!(k — 2— s)! —1 1 
(k ni 7 5) = ah < s!Var(y) (= + 5) < siVar(x). 
Now, apply the first inequality of Theorem 12.5 with s of that theorem set to k — 2 or 
k — 3 whichever is even. Note that a = E(x) < V2no? (since e < 1/k?). The present 
theorem follows by a calculation. E 





So, |E((ai — u)*)| < 
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12.7 Eigenvalues and Eigenvectors 


Let A bean n xn real matrix. The scalar A is called an eigenvalue of A if there exists a 
non-zero vector x satisfying the equation Ax = Ax. The vector x is called the eigenvector 
of A associated with A. The set of all eigenvectors associated with a given eigenvalue form 
a subspace as seen from the fact that if Ax = Ax and Ay = Ay, then for any scalars c 
and d, A(cx + dy) = \(cx+ dy). The equation Ax = Ax has a non-trivial solution only if 
det(A — AI) = 0. The equation det(A — AT) = 0 is called the characteristic equation and 
has n not necessarily distinct roots. 


Matrices A and B are similar if there is an invertible matrix P such that A = P~!BP. 
Theorem 12.8 If A and B are similar, then they have the same eigenvalues. 


Proof: Let A and B be similar matrices. Then there exists an invertible matrix P 
such that A = P=*BP. For an eigenvector x of A with eigenvalue A, Ax = Ax, which 
implies P~' BPx = Ax or B(Px) = \(Px). So, Px is an eigenvector of B with the same 
eigenvalue A. Since the reverse also holds, the theorem follows. A 


Even though two similar matrices, A and B, have the same eigenvalues, their eigen- 
vectors are in general different. 


The matrix A is diagonalizable if A is similar to a diagonal matrix. 


Theorem 12.9 A is diagonalizable if and only if A has n linearly independent eigenvec- 
tors. 


Proof: 


(only if) Assume A is diagonalizable. Then there exists an invertible matrix P 
and a diagonal matrix D such that D = P-!AP. Thus, PD = AP. Let the diago- 
nal elements of D be Ay, A2,...,An and let p1,P2,...,Pn be the columns of P. Then 
AP = (Api, Apz,..., Apn] and PD = [Aipi, A2Pe,..-,AnPn|. Hence Ap; = Api. That 
is, the A; are the eigenvalues of A and the p; are the corresponding eigenvectors. Since P 
is invertible, the p; are linearly independent. 


(if) Assume that A has n linearly independent eigenvectors p1,P2,...,Pn With cor- 
responding eigenvalues A1,A2,...,An. Then Ap; = A;p; and reversing the above steps 


AP = [Ap1, Ape, bak , APn] = [A1Pa, A2P2, od AnPn] = PD. 


Thus, AP = PD. Since the p; are linearly independent, P is invertible and hence A = 
PDP". Thus, A is diagonalizable. E 
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It follows from the proof of the theorem that if A is diagonalizable and has eigenvalue 
A with multiplicity k, then there are k linearly independent eigenvectors associated with A. 


A matrix P is orthogonal if it is invertible and P7!* = PT. A matrix A is orthogonally 
diagonalizable if there exists an orthogonal matrix P such that P~'AP = D is diagonal. 
If A is orthogonally diagonalizable, then A = PDP? and AP = PD. Thus, the columns 
of P are the eigenvectors of A and the diagonal elements of D are the corresponding 
eigenvalues. 


If P is an orthogonal matrix, then P? AP and A are both representations of the same 
linear transformation with respect to different bases. To see this, note that if e 1, e2,...,¢n 
is the standard basis, then a;; is the component of Ae; along the direction e;, namely, 
ai; = €" Ae;. Thus, A defines a linear transformation by specifying the image under the 
transformation of each basis vector. Denote by p; the j“ column of P. It is easy to see that 
(P?AP),; is the component of Ap; along the direction p;, namely, (P?AP),; = pi’ Apj. 
Since P is orthogonal, the pj form a basis of the space and so P’ AP represents the same 
linear transformation as A, but in the basis pj, p2,...,Dn- 


Another remark is in order. Check that 
A= PDP? = X ` dippi”. 
i=1 
Compare this with the singular value decomposition where 


n 
A= > TU: 
i=l 


the only difference being that u; and v; can be different and indeed if A is not square, 
they will certainly be. 


12.7.1 Symmetric Matrices 


For an arbitrary matrix, some of the eigenvalues may be complex. However, for a 
symmetric matrix with real entries, all eigenvalues are real. The number of eigenvalues 
of a symmetric matrix, counting multiplicities, equals the dimension of the matrix. The 
set of eigenvectors associated with a given eigenvalue form a vector space. For a non- 
symmetric matrix, the dimension of this space may be less than the multiplicity of the 
eigenvalue. Thus, a non-symmetric matrix may not be diagonalizable. However, for a 
symmetric matrix the eigenvectors associated with a given eigenvalue form a vector space 
of dimension equal to the multiplicity of the eigenvalue. Thus, all symmetric matrices are 
diagonalizable. The above facts for symmetric matrices are summarized in the following 
theorem. 


Theorem 12.10 (Real Spectral Theorem) Let A be a real symmetric matriz. Then 
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1. The eigenvalues, A1, A2,..., An, are real, as are the components of the corresponding 
eigenvectors, V1,Va,..., Vn- 


2. (Spectral Decomposition) A is orthogonally diagonalizable and indeed 
A=VDV” =X àvi", 
i=1 


where V is the matrix with columns vı, V2,...,Vn, |vi| = 1 and D is a diagonal 
matrix with entries A1,A2,..., An. 


Proof: Av; = A;vi and v;°Av; = A;vi°v;. Here the c superscript means conjugate trans- 
pose. Then 
Ai = Vv; Av; = (vi Avi)“ NN (vif Avi): = (vit Avi)" = AF 


and hence A; is real. 
Since A; is real, a non-trivial solution to (A — A¿1) x = 0 has real components. 


Let P be a real symmetric matrix such that Pv, = e, where e, = (1,0,0,... se and 
P-t = PT. We will construct such a P shortly. Since Avi = \1v1, 


PAP’ e, = PAv = \Pv; = Me. 


Ai 0 
0 A’ 
A’ is n— 1 by n—1 and symmetric. By induction, A’ is orthogonally diagonalizable. Let 
Q be the orthogonal matrix with QA'QT = D’, a diagonal matrix. Q is (n — 1) x (n—1). 
Augment Q to an n x n matrix by putting 1 in the (1,1) position and 0 elsewhere in the 
first row and column. Call the resulting matrix R. R is orthogonal too. 


à 0 Tr (A0 rTpr (A 0 
eo Je) arar (30). 


The condition PAPTe = A¡e1 plus symmetry implies that PAP? = where 


Since the product of two orthogonal matrices is orthogonal, this finishes the proof of (2) 
except it remains to construct P. For this, take an orthonormal basis of space containing 
vı. Suppose the basis is {v1, wa, w3,...) and V is the matrix with these basis vectors as 
its columns. Then P = VT will do. A 


Theorem 12.11 (The fundamental theorem of symmetric matrices) A real ma- 
trix A is orthogonally diagonalizable if and only if A is symmetric. 


Proof: (if) Assume A is orthogonally diagonalizable. Then there exists P such that 
D = PAP: Since P=! = PT, we get 


A= PDP! = PDP” 
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which implies 
AT = (PDP)! = PDP? = A 


and hence A is symmetric. 
(only if) Already proved. E 


Note that a non-symmetric matrix may not be diagonalizable, it may have eigenvalues 
that are not real, and the number of linearly independent eigenvectors corresponding to 
an eigenvalue may be less than its multiplicity. For example, the matrix 


=. o.e 
Orr 
=.. o 





has eigenvalues 2, 5 | ims, and 5 iva. The matrix has characteristic equation 


1 2 
0 1 
(1 — A)? = 0 and thus has eigenvalue 1 with multiplicity 2 but has only one linearly 
1 
independent eigenvector associated with the eigenvalue 1, namely x = c ( 0 ) CFO: 


Neither of these situations is possible for a symmetric matrix. 


12.7.2 Relationship between SVD and Eigen Decomposition 


The singular value decomposition exists for any n x d matrix whereas the eigenvalue 
decomposition exists only for certain square matrices. For symmetric matrices the de- 
compositions are essentially the same. 


The singular values of a matrix are always positive since each singular value is the 
length of the vector of projections of the rows to the corresponding singular vector. Given 
a symmetric matrix, the eigenvalues can be positive or negative. If A is a symmetric 
matrix with eigenvalue decomposition A = V¿DÉ¿VÉ and singular value decomposition 
A = UsDsVé, what is the relationship between Dg and Dg, and between Vg and Vs, and 
between Us and Vg? Observe that if A can be expressed as QDQ” where Q is orthonormal 
and D is diagonal, then AQ = QD. That is, each column of Q is an eigenvector and the 
elements of D are the eigenvalues. Thus, if the eigenvalues of A are distinct, then Q is 
unique up to a permutation of columns. If an eigenvalue has multiplicity k, then the space 
spanned the k columns is unique. In the following we will use the term essentially unique 
to capture this situation. Now AAT = UgsD2UZ and ATA = VsD2V¢. By an argument 
similar to the one above, Us and Vs are essentially unique and are the eigenvectors or 
negatives of the eigenvectors of A and A”. The eigenvalues of AAT or A’ A are the squares 
of the eigenvalues of A. If A is not positive semi definite and has negative eigenvalues, 
then in the singular value decomposition A = UsDsVs, some of the left singular vectors 
are the negatives of the eigenvectors. Let S be a diagonal matrix with +1's on the 
diagonal depending on whether the corresponding eigenvalue is positive or negative. Then 
A = (UsS)(SDs)Vs where UsS = Vg and SDs = Dg. 
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12.7.3 Extremal Properties of Eigenvalues 


In this section we derive a min max characterization of eigenvalues that implies that 
the largest eigenvalue of a symmetric matrix A has a value equal to the maximum of 
x! Ax over all vectors x of unit length. That is, the largest eigenvalue of A equals the 
2-norm of A. If A is a real symmetric matrix there exists an orthogonal matrix P that 
diagonalizes A. Thus 

PTAP =D 


where D is a diagonal matrix with the eigenvalues of A, Ay > A2 > --: > Ap, on its 
diagonal. Rather than working with A, it is easier to work with the diagonal matrix D. 
This will be an important technique that will simplify many proofs. 





Consider maximizing xT Ax subject to the conditions 


Ms 


1. a1 


«=1 


2. tx =O, 1l<i<s 


where the r; are any set of non-zero vectors. We ask over all possible sets {r;|1 <i < s} 
of s vectors, what is the minimum value assumed by this maximum. 


Theorem 12.12 (Min max theorem) For a symmetric matrix A, min max(x'Ax) = 
pee polis 

As+1 where the minimum is over all sets {r1,T2,...,175} of s non-zero vectors and the 

maximum is over all unit vectors x orthogonal to the s non-zero vectors. 


Proof: A is orthogonally diagonalizable. Let P satisfy PTP = I and PTAP = D, D 
diagonal. Let y = P?x. Then x = Py and 


n 
x? Ax = y PY APy = y Dy = ` hy? 
i=1 
Since there is a one-to-one correspondence between unit vectors x and y, maximizing 


xT Ax subject to Y x? = 1 is equivalent to maximizing X` A;y? subject to Y y? = 1. Since 


i=1 


A >i, 2<i<n,y =(1, 0, ..., 0) maximizes X` Ay? at Ay. Then x = Py is the first 
i=l 

column of P and is the first eigenvector of A. Similarly A, is the minimum value of xT Ax 

subject to the same conditions. 


Now consider maximizing xT Ax subject to the conditions 


IDA 
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where the r; are any set of non-zero vectors. We ask over all possible choices of s vectors 
what is the minimum value assumed by this maximum. 


min max x! Ax 
1) Ps x 
rTx=0 


As above, we may work with y. The conditions are 


1. X y2 =1 


2. qfy = 0 where, qf =r? P 


2 


Consider any choice for the vectors r,,r2,...,rs. This gives a corresponding set of q;. The 
y; therefore satisfy s linear homogeneous equations. If we add Ysy2 = Ys+3 = """Yn = 0 
we have n — 1 homogeneous equations in n unknowns 41,..., Yn. There is at least one 
solution that can be normalized so that X` y? = 1. With this choice of y 


y' Dy = > NY; >As+1 


since coefficients greater than or equal to s + 1 are zero. Thus, for any choice of r; there 
will be a y such that 

max (y* PTAPy) > As41 

Boe 
and hence 

min max(y?P?APy) > Asy1. 
11,12,...,ls Peo 

However, there is a set of s constraints for which the minimum is less than or equal to 
As+1. Fix the relations to be y; = 0, 1 <í < s. There are s equations in n unknowns 
and for any y subject to these relations 


y Dy = Y diye < Asus. 


s+1 
Combining the two inequalities, min max y? Dy = \,41. A 


The above theorem tells us that the maximum of x’ Ax subject to the constraint that 
|x|? = 1 is \,. Consider the problem of maximizing x? Ax subject to the additional re- 
striction that x is orthogonal to the first eigenvector. This is equivalent to maximizing 
y'P* APy subject to y being orthogonal to (1,0,...,0), i.e. the first component of y being 
0. This maximum is clearly A2 and occurs for y = (0,1,0,...,0). The corresponding x is 
the second column of P or the second eigenvector of A. 


Similarly the maximum of xT Ax for pix = p2"x = ---ps’x = 0 is As41 and is 
obtained for x = Psy1- 
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12.7.4 Eigenvalues of the Sum of Two Symmetric Matrices 


The min max theorem is useful in proving many other results. The following theorem 
shows how adding a matrix B to a matrix A changes the eigenvalues of A. The theorem 
is useful for determining the effect of a small perturbation on the eigenvalues of A. 


Theorem 12.13 Let A and B ben x n symmetric matrices. Let C=A+B. Let ai, Pi, 
and y; denote the eigenvalues of A, B, and C respectively, where a, > A > ...Qn and 
similarly for Bi, yi- Then as + 81 > Ys > As + Bn. 


Proof: By the min max theorem we have 


a= min max(x’ Ax). 
r1, s—1 ES 
rílx 
Suppose r1, r2,...,rs—-1 attain the minimum in the expression. Then using the min max 


theorem on C, 
ys < max (x’(A+ B)x) 
xX-111],12)---s-1 


max (x’Ax)+ max (x’ Bx) 
xLry,rea,...%s—1 x_11r1,12,...s-1 


<a, + max(x" Bx) < a, + Bi. 





IA 


Therefore, y, < Qs + Bj. 


An application of the result to A = C + (—B), gives a, < ys — Bn. The eigenvalues 
of -B are minus the eigenvalues of B and thus —P, is the largest eigenvalue. Hence 
Ys > Qs + By, and combining inequalities yields a, + 3; > Ys > Qs + Bn. E 


Lemma 12.14 Let A and B be n x n symmetric matrices. Let C=A+B. Let a;, bi, 
and y; denote the eigenvalues of A, B, and C respectively, where a, > az > ...Qn and 
similarly for Bi, yi- Then Yr+s-1 < Ar + Bs. 


Proof: There is a set of r—1 relations such that over all x satisfying the r—1 relationships 
max(x” Ax) = ay. 
And a set of s — 1 relations such that over all x satisfying the s — 1 relationships 
max(x’ Bx) = Bs. 
Consider x satisfying all these r + s — 2 relations. For any such x 
x’ Cx = x’ Ax +x’ Bxt <a, +B, 


and hence over all the x 
max(x! Cx) < a, + b, 


Taking the minimum over all sets of r + s — 2 relations 


Yp+s—1 = min max(x! Cx) < ar + Bs 
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12.7.5 Norms 


A set of vectors {x1,...,Xn} is orthogonal if xxx; = 0 for i 4 j and is orthonormal if 
in addition |x;| = 1 for all i. A matrix A is orthonormal if ATA = I. If A is a square 
orthonormal matrix, then rows as well as columns are orthogonal. In other words, if A 
is square orthonormal, then A” is also. In the case of matrices over the complexes, the 
concept of an orthonormal matrix is replaced by that of a unitary matrix. A* is the con- 
jugate transpose of A if aj; = aj; where aj, is the ijt? entry of A* and a; is the complex 
conjugate of the ijt? element of A. A matrix A over the field of complex numbers is 
unitary if AA* = I. 


Norms 


A norm on R” is a function f : R” > R satisfying the following three axioms: 
1. f(x) 20, 
2. f(xt+y) < f(x) + f(y), and 
3. f(ax) = la] f(x). 
A norm on a vector space provides a distance function where 
distance(x, y) = norm(x — y). 
An important class of norms for vectors is the p-norms defined for p > 0 by 


1 
xl, = (xa? +--+ [xn]?)?. 


Pp 


Important special cases are 


|x|o = the number of non-zero entries (not quite a norm: fails #3) 
x], = [eal + +++ + [en 
Ix|, = Ve)? +H [En]? 


|x|,, = max |z;|. 





Lemma 12.15 For any 1 <p <q, |xl, < |x|». 


Proof: 
xe >. elt 
i 


Let a; = |x;|4 and p = p/q. Using Jensen’s inequality (see Section 12.3) that for any 
non-negative reals a1,43,..., 4, and any p € (0,1), we have (>), a)? < X; af, the 
lemma is proved. E 
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There are two important matrix norms, the matrix p-norm 


(Al), = max || Axl, 


114117 = DTS 
tj 


Let a; be the i” column of A. Then [All = Maja; = tr (ATA). A similar argument 


and the Frobenius norm 


on the rows yields || A||} = tr (447). Thus, |All = tr (ATA) = tr (447). 
If A is symmetric and rank k 


Alls < All < EIA! 


12.7.6 Important Norms and Their Properties 
Lemma 12.16 |148/|, < ||All, 11811, 


Proof: ||AB||, = wa |ABx]|. Let y be the value of x that achieves the maximum and 
x|=1 


let z = By. Then 
Z 
|| AB||p =14By| = |42| = az 


|z| 








But [4g] < max |x| = [Ally and z| < max|Bx]| = [|B|]; Thus [JABI], < Ally Bll 

E 
Let Q be an orthonormal matrix. 

Lemma 12.17 For all x, |Qx| = |x|. 

Proof: |Qx|2=x"Q7Qx =x"x = |x|. A 

Lemma 12.18 ||QA||, = || Alla 

Proof: For all x, |Qx| = |x|. Replacing x by Ax, |QAx| = |Ax| and thus m [QAx] = 

max | Ax] E 


|x|=1 
Lemma 12.19 [AB] < ||AIÈ 1818 


Proof: Let a; be the i column of A and let bj be the j” column of B. By the 
Cauchy-Schwartz inequality ||a;Tb;|| < lla;[| ||b;l|. Thus ||AB||;. = NY |ax"b;|? < 
i j 


Y lalt bl? =X lal? X Wall? = Alle 11811 E 
E 2 J 
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Lemma 12.20 ||QA\|,, = ||All- 
Proof: ||QA!|2 = Tr(44QTQA) = Tr(A™ A) = ||Al]2. E 


Lemma 12.21 For real, symmetric matrix A with eigenvalues 41 > A2 > ..., IAI = 
max(à?, A2) and Al = A? + A2 +- +22 


Proof: Suppose the spectral decomposition of A is PDPT, where P is an orthogo- 
nal matrix and D is diagonal. We saw that |]P7.4]|2 = ||All2. Applying this again, 
||P? AP||2 = ||All2. But, PTAP = D and clearly for a diagonal matrix D, ||D]|2 is the 
largest absolute value diagonal entry from which the first equation follows. The proof of 
the second is analogous. E 


If A is real and symmetric and of rank k then ||A||} < [141% < k IAI 
Theorem 12.22 ||A||3 < |A|% < JAI 


Proof: It is obvious for diagonal matrices that ||D||} < Dll < k|IDIl%. Let D = 
Q AQ where Q is orthonormal. The result follows immediately since for Q orthonormal, 


QAll, = [|All and []QAllp = |IAlle- E 


Real and symmetric are necessary for some of these theorems. This condition was 
needed to express Y = QTAQ. For example, in Theorem 12.22 suppose A is the n x n 
matrix 
1 1 
1 1 

A= catas Y 
1 1 
|| Al], = 2 and ||A||p = v2n. But A is rank 2 and ||Aļ||p > 2||A||, for n > 8. 


Lemma 12.23 Let A be a symmetric matriz. Then ||Al|, = mog |x7 Ax]. 
x|=1 


Proof: By definition, the 2-norm of A is || A]|, = max |Ax|. Thus, 
x|=1 











|| All, = max |Ax| = tee VxPATAx = J? =M max |x" Ax] 


|x|=1 [x Ix]= 
A 


The two norm of a matrix A is greater than or equal to the 2-norm of any of its 
columns. Let a, be a column of A. 


Lemma 12.24 |ay| < || All 
Proof: Let e, be the unit vector with a 1 in position u and all other entries zero. Note 


A= nae |Az|. Let x =e, where au is row u. Then |ay| = |Aeu| < ne |Ax| = A A 
r= |= 
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12.7.7 Additional Linear Algebra 
Lemma 12.25 Let A be ann xn symmetric matrix. Then det(A) = A142:-* An. 


Proof: The det (A — AJ) is a polynomial in A of degree n. The coefficient of A” will be +1 
depending on whether n is odd or even. Let the roots of this polynomial be Ay, A2,..., An. 


Tica det(A= 7) = (= 1)" 1 =o. Thu 


i=1 


det(A) = det(A — AD)|,-y = (-D)" [[ A ANAE An 


i=1 A=0 


The trace of a matrix is defined to be the sum of its diagonal elements. That is, 
tr (A) = 011 + a22 + +++ + ann- 


Lemma 12.26 tr(A) = A, + A2+ +--+ An. 


Zs 


Proof: Consider the coefficient of \"~! in det(A — AI) = (—1)” J] (A— åà;). Write 


i=1 


a1 A ay 


A—Al= a21 a22 — À 


Calculate det(A — AJ) by expanding along the first row. Each term in the expansion 
involves a determinant of size n — 1 which is a polynomial in À of deg n — 2 except for 
the principal minor which is of deg n — 1. Thus the term of deg n — 1 comes from 


(a11 — A) (a22 — A) +++ (ann — A) 
and has coefficient (-1)* (ay + a22 + +++ + ann). Now 


-D° TT A= Xs) = (-1)" A= ADA = Aa) +- A= An) 


i=1 


= (IU (Ar HA2 FARAH) 


Therefore equating coefficients A; + A2 +--+ + An = a11 + a22 + +++ + ann = tr(A) 


Note that (tr(A))? Z tr(A?). For example A = . ; ) has trace 3, A? = ( ; ) 


has trace 5 #9. However tr(A°) = Aj + A2 +--+ A2. To see this, observe that A? = 
(VTDV)? = VTD?V. Thus, the eigenvalues of A? are the squares of the eigenvalues for 
A. E 
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Alternative proof that tr(A) = A+ A2+ +--+ An : Suppose the spectral decomposition 
of A is A= PDP’. We have 


tr(A)=tr (PDP?) <i (DPTP) =H(D)=M bd + + An. 
Lemma 12.27 IfA isnxm and B is am xn matriz, then tr(AB)=tr(BA). 
Proof: 


tr(AB) = ` y aij Dj = y y biiQij = tr (BA) 


i=1 j=l j=l i=1 


Pseudo inverse 


Let A be an n x m rank r matrix and let A = UXV" be the singular value decomposi- 


tion of A. Let >” = diag (2, pc = 0,... .0) where o1,...,0, are the non-zero singular 


values of A. Then A’ = VXY'U” is the pseudo inverse of A. It is the unique X that 
minimizes ||AX — I||,. 


Second eigenvector 


Suppose the eigenvalues of a matrix are Ay > A2 > +++. The second eigenvalue, 
A2, plays an important role for matrices representing graphs. It may be the case that 
An] > 142]. 


Why is the second eigenvalue so important? Consider partitioning the vertices of a 
regular degree d graph G = (V, E) into two blocks of equal size so as to minimize the 
number of edges between the two blocks. Assign value +1 to the vertices in one block and 
-1 to the vertices in the other block. Let x be the vector whose components are the +1 
values assigned to the vertices. If two vertices, 7 and 7, are in the same block, then x; and 
x; are both +1 or both -1 and (2;—;)? = 0. If vertices ¿ and j are in different blocks then 
(x; —x,;)? = 4. Thus, partitioning the vertices into two blocks so as to minimize the edges 
between vertices in different blocks is equivalent to finding a vector x with coordinates 
+1 of which half of its coordinates are +1 and half of which are —1 that minimizes 


1 
Pest = 4 ` (2; Fa rj) 


(4,j)EE 
Let A be the adjacency matrix of G. Then 


x? Ax = Yates = a LiL; 
ij edges 


j 
ee number of edges z3 number of edges 
E within components between components 


total number number of edges 
=2 x — 4x 
of edges between components 
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Maximizing xT Ax over all x whose coordinates are +1 and half of whose coordinates are 
+1 is equivalent to minimizing the number of edges between components. 


Since finding such an x is computationally difficult, one thing we can try to do is 
replace the integer condition on the components of x and the condition that half of the 
components are positive and half of the components are negative with the conditions 


Y” x? = 1 and Y) z; = 0. Then finding the optimal x gives us the second eigenvalue since 
i=l i=l 
it is easy to see that the first eigenvector is along 1 


Actually we should use Y xz? = n not Y x? = 1. Thus nAz must be greater than 
i=1 i=1 
total number number of edges 
2 —4x 
of edges between components 
a larger set of x. The fact that A2 gives us a bound on the minimum number of cross 


edges is what makes it so important. 


) since the maximum is taken over 


12.7.8 Distance Between Subspaces 


Suppose Sı and Sy are two subspaces. Choose a basis of Sı and arrange the basis 
vectors as the columns of a matrix X4; similarly choose a basis of S and arrange the 
basis vectors as the columns of a matrix Xə. Note that Sı and Sy can have different 
dimensions. Define the square of the distance between two subspaces by 


dist?(S,, S92) = dist? (Xi, X2) = ||X1 — XXX ||} 
Since X, — XXT X; and XX7 X, are orthogonal 
2 2 
[PO Il = [144 -XX3 X|]; + 1424320 [[, 


and hence . 

dis? (X1, X2) = ||Xallp — ||X2X7 X| - 
Intuitively, the distance between X, and Xə is the Frobenius norm of the component of 
Xı not in the space spanned by the columns of X». 


If X, and X, are 1-dimensional unit length vectors, dist? (X1, X2) is the sine squared 
of the angle between the spaces. 


Example: Consider two subspaces in four dimensions 


AVE X: = 


FRR 


aala S 
ooo. 
oono 
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Here 








1 1 2 
a ee oo 
dist? (X1, X2) = 1 v3 a ; : Lo i i 2 1 v3 
V v3 V v3 
0 aL 0 0 0 1 
V3 V3 F 
0 0 ° 
0 0 7 
= 3 1 =A 
vi VB 6 
0 v3 F 


In essence, we projected each column vector of Xy onto X> and computed the Frobenius 
norm of Xı minus the projection. The Frobenius norm of each column is the sin squared 
of the angle between the original column of X, and the space spanned by the columns of 


Xa. A 
12.7.9 Positive Semidefinite Matrix 


A square symmetric matrix is positive semidefinite if for all x, x Ax > 0. There are 
actually three equivalent definitions of positive semidefinite. 


1. for all x, x? Ax > 0 
2. all eigenvalues are non-negative 
3. A= BTB 
We will prove (1) implies (2), (2) implies (3), and (3) implies (1). 
1. (1) implies (2) If \; were negative, select x = vi. Then v? (Av;) = v? (Aivi) = à; < 0. 
2. (2) implies (3) A= VDV? =VD3D3V7 = BTB 
3. (3) implies (1) xT Ax = (xB)? Bx > 0 


12.8 Generating Functions 


A sequence ag, @1,..., can be represented by a generating function g(x) = X a¡z*. The 
i=0 


advantage of the generating function is that it captures the entire sequence in a closed 
form that can be manipulated as an entity. For example, if g(x) is the generating func- 
tion for the sequence do, a1,..., then at g(x) is the generating function for the sequence 
0, a1, 2a, 3a3,... and x*g"(x) + xg'(x) is the generating function for the sequence for 
0, a1, 4az, 9az3, adri 
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Example: The generating function for the sequence 1,1,... is Y zê = =. The gener- 
i=0 
ating function for the sequence 0, 1,2,3,... is 
SS ted a ed i dO pi d a 
2 ix = Dorri = 1 20 = rt = 3: 
1=0 1=0 1=0 
a 


Example: If A can be selected 0 or 1 times and B can be selected 0, 1, or 2 times and 
C can be selected 0, 1, 2, or 3 times, in how many ways can five objects be selected? 
Consider the generating function whose 7 coefficient is the number of ways to select 7 
objects. The generating function for the number of ways of selecting objects, selecting 
only A’s is 1 + 2, only B’s is 1+ g + 27, and only C's is 1+2+27+4+ 23. The generating 
function when selecting A’s, B’s, and C’s is the product. 








(+c) toto) +x +r’ +r’) = 1439+ bn? + 6r? +52" + 32° + r’ 


The coefficient of x° is 3 and hence we can select five objects in three ways: ABBCC, 
ABCCC, or BBCCC. E 


The generating functions for the sum of random variables 
Let f(x) = Y pizt be the generating function for an integer valued random variable 
i=0 
where p; is the probability that the random variable takes on value i. Let g(x) = >> qx’ 
i=0 


be the generating function of an independent integer valued random variable where q; 
is the probability that the random variable takes on the value 7. The sum of these two 
random variables has the generating function f(x)g(x). This is because the coefficient of 
x’ in the product f(1)g(x) is Yo Pkqk-i and this is also the probability that the sum of 
the random variables is 2. Repeating this, the generating function of a sum of independent 
non-negative integer valued random variables is the product of their generating functions. 


12.8.1 Generating Functions for Sequences Defined by Recurrence Relation- 
ships 


Consider the Fibonacci sequence 
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... 


defined by the recurrence relationship 


fo=0 fi=l fia fi Piet 22 
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Multiply each side of the recurrence by x‘ and sum i from two to infinity. 


Ss fix’ = y fix + yo fiori 
i=2 i=2 i=2 


-+4 for? } fix’ perses 


ha +---) +27 otir) (12.1) 





fox? + fax? +- = fix? + for? 
=a (fix 





Let Ss 
Ha) =)d fiat. (12.2) 
i=0 
Substituting (12.2) into (12.1) yields 


( 

f(x) — fo- fiw = 2 (f (£) — fo) + 2° f (2) 
f(x) -z = xf(x) +2’ f(x) 
f(x)\(1-2—2*)=a 


Thus, f(x) = 5 is the generating function for the Fibonacci sequence. 


Note that generating functions are formal manipulations and do not necessarily con- 
verge outside some region of convergence. Consider the generating function f(x) = 


OO OO 
> fit’ = 22 for the Fibonacci sequence. Using > fix’, 
i=0 i= 





1=0 
FL) = fot fit rte = 00 
and using f(x) = 22 
1 
MS EE 1. 


Asymptotic behavior 


To determine the asymptotic behavior of the Fibonacci sequence write 


v5 _v5 


z 
P= l-r- Ee are 








where ¢, = ive and ¢2 = 148 are the reciprocals of the two roots of the quadratic 
1—x—at=0. 


Then 





f(z) = (1 Hoir + (br) +. — (1+ don + (620) +---)). 


Thus, 


fn = 


V5 n n 
5 (61 — $2). 
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Since 2 < 1 and 1 > 1, for large n, fn = gr. In fact, since f, = v3 (97 — $y) is an 
integer and ¢2 < 1, it must be the case that fn = | fn + Bop | . Hence fn = | ar | for 


all n. 
Means and standard deviations of sequences 


Generating functions are useful for calculating the mean and standard deviation of a 
sequence. Let z be an integral valued random variable where p; is the probability that 


z equals i. The expected value of z is given by m = J` ip;. Let p(x) = Y pizt be the 
i=0 i=0 
generating function for the sequence pj, p2,.... The generating function for the sequence 


Pı, 2p2, 3P3, -is 
d ZW, 
TZ pit) = Pit . 
qre) > p 
Thus, the expected value of the random variable z is m = xp'(x)|,-1 = p'(1). If p was not 


a probability function, its average value would be ? N since we would need to normalize 


a] 
the area under p to one. 


The variance of z is E(2?) — E?(z) and can be obtained as follows. 














ene] = i= De'e) 
= Ponlo) - > iv(z) 
ON = 


2 


Thus, o? = E(2?) — B?(2) = E(2?) — B(z) + E(z) — EX) =9"(1) + p'(1) — (v'(0))”. 
12.8.2 The Exponential Generating Function and the Moment Generating 
Function 


Besides the ordinary generating function there are a number of other types of gener- 
ating functions. One of these is the exponential generating function. Given a sequence 


do, d1,... , the associated exponential generating function is g(x) = >> ait. 
i=0 
Moment generating functions 
The kt? moment of a random variable x around the point b is given by E((x — b)*). 


Usually the word moment is used to denote the moment around the value 0 or around 
the mean. In the following, we use moment to mean the moment about the origin. 
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The moment generating function of a random variable x is defined by 








W(t) = E(e*”) = J e"p(x)dx 
Replacing e** by its power series expansion 1 + tx + ln) aber gives 
Ss 2 
t 
V(t) = i p + tz 4 - | =) p(x)dz 


— 00 


Thus, the kt” moment of x about the origin is k! times the coefficient of të in the power 
series expansion of the moment generating function. Hence, the moment generating func- 
tion is the exponential generating function for the sequence of moments about the origin. 


The moment generating function transforms the probability distribution p(x) into a 
function W(t) of t. Note W(0) = 1 and is the area or integral of p(x). The moment 
generating function is closely related to the characteristic function which is obtained by 
replacing e by e*” in the above integral where i = y=—1 and is related to the Fourier 
transform which is obtained by replacing et? by e~*. 


W(t) is closely related to the Fourier transform and its properties are essentially the 
same. In particular, p(x) can be uniquely recovered by an inverse transform from W(t). 


More specifically, if all the moments m; are finite and the sum >> “#t’ converges abso- 


i=0 
lutely in a region around the origin, then p(x) is uniquely determined. 


The Gaussian probability distribution with zero mean and unit variance is given by 


2 
pla) = Tmt 7 Its moments are given by 


1 / 2 
Un = —= x'e 2 dx 
V 21 


-f A] n even 





0 n odd 


To derive the above, use integration by parts to get Uun = (n — 1) u,-2 and combine 


2 

this with uy = 1 and u, = 0. The steps are as follows. Let u =e” 2 and v=x""!, Then 
q2 

ul = —xe~ T and v'=(n—1)17?. Now w = f u'v+ fuv' or 


2 


x z2 x 
e ag = feas + f (n — 1) 172 7 de. 
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From which A R 
x x 
fue = dx =(n—1) fare? da — eran 


fan + de = (n—1) yan 2-5 da 


—oo =o00 


Thus, u, = (n — 1) u,-2. 


The moment generating function is given by 


00 


= Uns” LQ s” Es 1 | 
g(s)= >) nt 2 El > 25 oe J 


n=0 n=0 1=0 
n even 


Ole 








For the general Gaussian, the moment generating function is 
su cal s2 
g(s) = +E) 
Thus, given two independent Gaussians with mean u, and uz and variances o? and 0%, 
the product of their moment generating functions is 


esturtua)+(of+03)s? 
the moment generating function for a Gaussian with mean u; + uz and variance o? +03. 
Thus, the convolution of two Gaussians is a Gaussian and the sum of two random vari- 
ables that are both Gaussian is a Gaussian random variable. 


12.9 Miscellaneous 
12.9.1 Lagrange Multipliers 


Lagrange multipliers are used to convert a constrained optimization problem into an un- 
constrained optimization. Suppose we wished to maximize a function f(x) subject to a 
constraint g(x) = c. The value of f(x) along the constraint g(x) = c might increase for 
a while and then start to decrease. At the point where f(x) stops increasing and starts 
to decrease, the contour line for f(x) is tangent to the curve of the constraint g(x) = c. 
Stated another way the gradient of f(x) and the gradient of g(x) are parallel. 


By introducing a new variable A we can express the condition by V,f = AVxg and 
g = c. These two conditions hold if and only if 


Va (f (x) +A (g (x) — c)) =0 


The partial with respect to A establishes that g(x) = c. We have converted the constrained 
optimization problem in x to an unconstrained problem with variables x and A. 
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<= glz,y)=c 


YA 


— f(x,y) 


Figure 12.3: In finding the minimum of f(x,y) within the ellipse, the path head towards 
the minimum of f(x,y) until it hits the boundary of the ellipse and then follows the 
boundary of the ellipse until the tangent of the boundary is in the same direction as the 
contour line of f(x,y). 


12.9.2 Finite Fields 


For a prime p and integer n there is a unique finite field with p” elements. In Section 
8.6 we used the field GF(2"), which consists of polynomials of degree less than n with 
coefficients over the field GF(2). In GF(28) 




























































































(£7 +e? + £) (£f +5 + r’) = tr? +r Hr” Hata Hr Hra 
PA A 
=f +t +r +? moderar +1 
Division of zt? + x1? + x1? + x° + x’ + 284 2° by xê + gxt +r’ +g +1 is illustrated below. 
gis aptas ET E A oen z5 
zara) r” +x° +28 zê +2° 
2o gy r pt 
—rt (x? +24 +x? +r +1) = Me xe +z +r’ +2 
710 ae yd 
—r? (z? +24 +r? +r +1) = go zê +a go g? 
ge he +r? +r’ 





12.9.3 Application of Mean Value Theorem 


The mean value theorem states that if f(x) is continuous and differentiable on the 
interval [a,b], then there exists c, a < c < b such that f’(c) = PET That is, at some 
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a C b 


Figure 12.4: Illustration of the mean value theorem. 


point between a and b the derivative of f equals the slope of the line from f(a) to f(b). 
See Figure 12.9.3. 


One application of the mean value theorem is with the Taylor expansion of a function. 
The Taylor expansion about the origin of f(x) is 


f(x) = FO) + Oe + ZO? + 3804: (12.3) 


By the mean value theorem there exists c, 0 < c < zx, such that f’(c) = DIO or 


f(x) — f(0) = xf"(c). Thus i 





efe) = fOe + FO) + EO 


and 


f(x) = (0) + xP (c). 


One could apply the mean value theorem to f'(x) in 


f'(x) = F'(0) + f"(O)x + ZO EE 
Then there exists d, 0 < d < x such that 
ef) = FOs + AO 
Integrating 


st f"(d) an 1 


1 
= gf" Orta O 
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Substituting into Eq(12.3) 
fle) = FO) + Or + 507F"(@). 


12.10 Exercises 

Exercise 12.1 What is the difference between saying f(n) is O(n?) and f(n) is o (n°)? 
Exercise 12.2 If f (n) = g(n) what can we say about f(n) + g(n) and f(n) — g(n)? 
Exercise 12.3 What is the difference between ~ and O? 

Exercise 12.4 If f (n) is O(g(n)) does this imply that g (n) is Q (f (n))? 

Exercise 12.5 What is Jim (o ES 


Exercise 12.6 Select a, b, and c uniformly at random from [0,1]. The probability that 
b <a is 1/2. The probability that c<a is 1/3. However, the probability that both b and c are 
less than a is 3 not 1/4. Why is this? Note that the six possible permutations abc, acb, 
bac, cab, bca, and cba, are all equally likely. Assume that a, b, and c are drawn from the 


interval (0,1]. Given that b < a, what is the probability that c < a? 

Exercise 12.7 Let A,,42,..., A, be events. Prove that Prob(A¡UA3U:-+ An) < >> Prob(A;) 
i=1 

Exercise 12.8 Give an example of three random variables that are pairwise independent 


but not fully independent. 


Exercise 12.9 Give examples of non-negative valued random variables with median >> 
mean. Can we have median << mean? 


Exercise 12.10 Consider n samples x1, £2,..., £n from a Gaussian distribution of mean 
and variance o. For this distribution m = 2++22%-+*n is an unbiased estimator o 
n 


n 
u. If y is known then 4Y (x; — i)” is an unbiased estimator of o?. Prove that if we 
i=l 


3 


approximate u by m, then + (zi — my is an unbiased estimator of o°. 
i=l 


AN2 
Exercise 12.11 Given the distribution zel) what is the probability that x >1? 


x2 
Exercise 12.12 e 2 has value 1 at x = 0 and drops off very fast as x increases. Suppose 


q? 
we wished to approximate e 2 by a function f(x) where 
tala 
n= 0 |e) >a ` 


x2 


What value of a should we use? What is the integral of the error between f(x) and e” 2 ? 
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Exercise 12.13 Given two sets of red and black balls with the number of red and black 
balls in each set shown in the table below. 











red | black 
Set 1 40 | 60 
Set 2 50 150 

















Randomly draw a ball from one of the sets. Suppose that it turns out to be red. What is 
the probability that it was drawn from Set 1? 


Exercise 12.14 Why cannot one prove an analogous type of theorem that states p(x <a) < 
E(@) 9 


Exercise 12.15 Compare the Markov and Chebyshev bounds for the following probability 
distributions 


TSI 
So { 0 otherwise 


1/2 0<a<2 
oe l 0 otherwise 
Exercise 12.16 Lets be the sum of n independent random variables £1, Z2,..., Y, where 
for each i 


= JO Prob p 
“MEX 1 Prob 1—p 


1. How large must 6 be if we wish to have Prob (s < (1— 8) m) < e? 


2. If we wish to have Prob (s > (1+6)m) < e? 


Exercise 12.17 What is the expected number of flips of a coin until a head is reached? 
Assume p is probability of a head on an individual flip. What is value if p=1/2? 


Exercise 12.18 Given the joint probability 











"P(A,B) | A=0 A=1 
B=0 1/16 1/8 
[B=1 1/4 9/16 














1. What is the marginal probability of A? of B? 


2. What is the conditional probability of B given A? 
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Exercise 12.19 Consider independent random variables xı, £2, and x3, each equal to 
zero with probability z. Let S = x1 + £2 + 23 and let F be event that S € {1,2}. Condi- 
tioning on F, the variables x,, £2, and x3 are still each zero with probability > Are they 
still independent? 


Exercise 12.20 Consider rolling two dice A and B. What is the probability that the sum 
S will add to nine? What is the probability that the sum will be 9 if the roll of A is 3? 


Exercise 12.21 Write the generating function for the number of ways of producing chains 
using only pennies, nickels, and dines. In how many ways can you produce 23 cents? 


Exercise 12.22 A dice has six faces, each face of the dice having one of the numbers 1 
though 6. The result of a role of the dice is the integer on the top face. Consider two roles 
of the dice. In how many ways can an integer be the sum of two roles of the dice. 


Exercise 12.23 Jf a(x) is the generating function for the sequence ao, a1, az, ..., for what 
sequence is a(1)(1-1) the generating function. 


Exercise 12.24 How many ways can one draw n a's and b's with an even number of a's. 


Exercise 12.25 Find the generating function for the recurrence a; = 2a;-1 + i where 
ag = 1. 


Exercise 12.26 Find a closed form for the generating function for the infinite sequence 
of prefect squares 1, 4, 9, 16, 25, ... 


Exercise 12.27 Given that = is the generating function for the sequence 1,1,..., for 


what sequence is aoe the generating function? 


Exercise 12.28 Find a closed form for the exponential generating function for the infinite 
sequence of prefect squares 1, 4, 9, 16, 25, ... 


Exercise 12.29 Prove that the Lz norm of (a1, @2,...,@n) is less than or equal to the Lı 
norm of (a1, @2,..-, Qn). 


Exercise 12.30 Prove that there exists ay,0<y < x, such that f(x) = f(0) + f'(y)z. 


Exercise 12.31 Show that the eigenvectors of a matrix A are not a continuous function 
of changes to the matrix. 


Exercise 12.32 What are the eigenvalues of the two graphs shown below? What does 
this say about using eigenvalues to determine if two graphs are isomorphic. 


lA 
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Exercise 12.33 Let A be the adjacency matrix of an undirected graph G. Prove that 
eigenvalue A, of A is at least the average degree of G. 


Exercise 12.34 Show that if A is a symmetric matrix and A, and A are distinct eigen- 
values then their corresponding eigenvectors x, and xa are orthogonal. 
Hint: 


Exercise 12.35 Show that a matrix is rank k if and only if it has k non-zero eigenvalues 
and eigenvalue 0 of rank n-k. 

Exercise 12.36 Prove that maximizing An is equivalent to maximizing x’ Ax subject 
to the condition that x be of unit length. 





Exercise 12.37 Let A be a symmetric matrix with smallest eigenvalue Amin. Give a 
bound on the largest element of At. 


Exercise 12.38 Let A be the adjacency matrix of an n vertex clique with no self loops. 
Thus, each row of A is all ones except for the diagonal entry which is zero. What is the 
spectrum of A. 


Exercise 12.39 Let A be the adjacency matrix of an undirect graph G. Prove that the 
eigenvalue A, of A is at least the average degree of G. 


Exercise 12.40 We are given the probability distribution for two random vectors x and 
y and we wish to stretch space to maximize the expected distance between them. Thus, 


d 

we will multiply each coordinate by some quantity a;. We restrict Y a? = d. Thus, if we 
i=1 

increase some coordinate by a; > 1, some other coordinate must shrink. Given random 

vectors £ = (11,,%3,..., 74) and y = (Y1, Y2,---,Ya) how should we select a; to maximize 


E (|x — yl’) ? The a; stretch different coordinates. Assume 


JO 
Yi = 1 


and that x; has some arbitrary distribution. 


NINI = 


d d 
E (Jæ — yl’) = E), la? (zi — yi)" = aE (2; — DE Yi +4?) 
¡=1 


Since E (x?) = E (x;) we get . Thus, weighting the coordinates has no effect assuming 
d 
YN az =1. Why is this? Since E (yi) = 3. 
i=l 
E (Ja — yl’) is independent of the value of x; hence its distribution. 
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and E (yi) = 4. Then 


E] 00 


d d 
E (|x — yl’) = Y a?E (£? — 2riyi + y?) = X) aE (x; = ix; E 1) 
tl i=1 
d 
= X a; (38 (23) + 4) 


To maximize put all weight on the coordinate of x with highest probability of one. What 
if we used 1-norm instead of the two norm? 


E (Ja — yl) = BY ala- alar ta n= Da 


d 
where b; = E (x; — y;). If 2, a? = 1, then to maximize let a; = bi, Taking the dot product 


i= 
of a and b is maximized when both are in the same direction. 


Exercise 12.41 Maximize x+y subject to the constraint that x? + y? = 1. 


467 


Index 


2-universal, 187 
4-way independence, 194 


Affinity matrix, 232 
Algorithm 
greedy k-clustering, 218 
k-means, 214 
singular value decomposition, 51 
Almost surely, 256 
Anchor Term, 322 
Aperiodic, 78 
Arithmetic mean, 422 


Bad pair, 260 

Bayes rule, 432 

Bayesian, 343 

Bayesian network, 343 

Belief Network, 343 

belief propagation, 342 

Bernoulli trials, 429 

Best fit, 40 

Bigoh, 411 

Binomial distribution, 251 
approximated by Poisson, 430 

Boosting, 170 

Branching process, 277 

Breadth-first search, 265 


Cartesian coordinates, 17 
Cauchy-Schwartz inequality, 418, 420 
Central Limit Theorem, 428 
Characteristic equation, 442 
Characteristic function, 459 
Chebyshev’s inequality, 13 
Chernoff bounds, 434 
Clustering, 211 
k-center criterion, 218 
k-means, 214 
Social networks, 239 
Sparse Cuts, 232 
CNF 


CNF-sat, 284 
Cohesion, 235 
Combining expert advice, 167 
Commute time, 105 
Conditional probability, 425 
Conductance, 98 
Coordinates 

Cartesian, 17 

polar, 17 
Coupon collector problem, 108 
Cumulative distribution function, 425 
Current 

probabilistic interpretation, 101 
Cycles, 271 

emergence, 270 

number of, 270 


Data streams 
counting frequent elements, 190 
frequency moments, 185 
frequent element, 191 
majority element, 190 
number of distinct elements, 186 
number of occurrences of an element, 
189 
second moment, 192 
Degree distribution, 251 
power law, 251 
Depth first search, 267 
Diagonalizable, 442 
Diameter of a graph, 259, 273 
Diameter two, 271 
dilation, 390 
Disappearance of isolated vertices, 271 
Discovery time, 103 
Distance 
total variation, 83 
Distribution 
vertex degree, 249 
Document ranking, 62 
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Effective resistance, 106 
Eigenvalue, 442 
Eigenvector, 54, 442 
Electrical network, 98 
Erdos Rényi, 248 
Error correcting codes, 193 
Escape probability, 102 
Euler’s constant, 109 
Event, 425 
Expander, 91 
Expected degree 
vertex, 248 
Expected value, 426 
Exponential generating function, 458 
Extinct families 
size, 281 
Extinction probability, 277, 279 


Finite fields, 461 

First moment method, 257 
Fourier transform, 375, 459 
Frequency domain, 376 


G(n,p), 248 
Gamma function, 18 
Gamma function , 420 
Gaussian, 23, 428, 460 
fitting to data, 29 
tail, 424 
Gaussians 
sparating, 27 
General tail bounds, 437 
Generating function, 277 
component size, 293 
for sum of two variables, 278 
Generating functions, 455 
Generating points in the unit ball, 22 
Geometric mean, 422 
Giant component, 249, 256, 262, 267, 271 
Gibbs sampling, 85 
Graph 
connecntivity, 270 
resistance, 109 
Graphical model, 342 


Greedy 
k-clustering, 218 

Growth models, 291 
with preferential attachment, 298 
without preferential attachment, 292 


Holder’s inequality, 418, 421 
Haar wavelet, 391 
Harmonic function, 99 
Hash function 

universal, 187 
Heavy tail, 251 
Hidden Markov model, 337 
Hidden structure, 241 
Hitting time, 103, 115 


Immortality probability, 279 
Incoherent, 373, 376 
Increasing property, 256, 275 
unsatisfiability, 285 
Independence 
limited way, 193 
Independent, 425 
Indicator random variable, 260 
of triangle, 254 
Indicator variable, 426 
Ising model, 357 
Isolated vertices, 262, 271 
number of, 262 


Jensen’s inequality, 423 
Johnson-Lindenstrauss lemma, 25, 26 


k-clustering, 218 

k-means clustering algorithm, 214 
Kernel methods, 231 

Kirchhoff’s law, 100 

Kleinberg, 300 


Lagrange, 460 

Laplacian, 71 

Law of large numbers, 12, 14 
Learning, 131 

Linearity of expectation, 254, 426 
Lloyd’s algorithm, 214 
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Local algorithm, 300 
Long-term probabilities, 81 


m-fold, 275 
Markov chain, 78 
state, 83 
Markov Chain Monte Carlo, 79 
Markov random field, 345 
Markov's inequality, 13 
Matrix 

multiplication 

by sampling, 196 

diagonalizable, 442 

similar, 442 
Maximum cut problem, 64 
Maximum likelihood estimation, 433 
Maximum likelihood estimator, 29 
Maximum principle, 99 
MCMC, 79 
Mean value theorem, 461 
Median, 427 
Metropolis- Hastings algorithm, 84 
Mixing time, 81 
Model 
random graph, 248 
Molloy Reed, 290 
Moment generating function, 459 
Mutually independent, 425 





Nearest neighbor problem, 27 
Non-uniform Random Graphs, 289 
Normalized conductance, 81, 90 
Number of triangles in G(n, p), 254 





Ohm's law, 100 
Orthonormal, 449 


Page rank, 114 
personalized , 117 
Persistent, 78 
Phase transition, 256 
CNF-sat, 284 
non-finite components, 296 
Poisson distribution, 430 
Polar coordinates, 17 


Polynomial interpolation, 193 
Positive semidefinite, 455 

Power iteration, 62 

Power law distribution, 251 
Power method, 51 

Power-law distribution, 289 
Principle component analysis, 56 
Probability density function, 424 
Probability distribution function, 424 
Psuedo random, 194 

Pure-literal heuristic, 286 


Queue, 286 
arrival rate, 286 


Radon, 144 

Random graph, 248 

Random projection, 25 
theorem, 25 

Random variable, 424 

Random walk 
Eucleadean space, 110 
in three dimensions, 111 
in two dimensions, 111 
on lattice, 110 
undirected graph, 103 
web, 113 

Rapid Mixing, 83 

Real spectral theorem, 443 

Replication, 275 

Resistance, 98, 109 
efffective, 102 

Restart, 114 
value, 114 

Return time, 114 


Sample space, 424 
Sampling 

length squared, 197 
Satisfying assignments 

expected number of, 285 
Scale function, 391 
Scale vector, 391 
Second moment method, 254, 257 
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Sharp threshold, 256 
Similar matrices, 442 
Singular value decomposition, 40 
Singular vector, 43 

first, 43 

left, 45 

right, 45 

second, 43 
Six-degrees separation, 300 
Sketch 

matrix, 200 
Sketches 

documents, 204 
Small world, 299 
Smallest-clause heuristic, 285 
Spam, 116 
Spectral clustering, 219 
Stanley Milgram, 299 
State, 83 
Stirling approximation, 419 
Streaming model, 184 
Symmetric matrices, 443 


Tail bounds, 434, 437 
Tail of Gaussian, 424 
Taylor series, 414 
Threshold, 255 
CNF-sat, 283 
diameter O(In n), 274 


disappearance of isolated vertices, 262 


emergence of cycles, 270 


emergence of diameter two, 259 
giant component plus isolated vertices, 


272 
Time domain, 376 
Total variation distance, 83 
Trace, 452 
Triangle inequality, 418 
Triangles, 253 


Union bound, 426 
Unit-clause heuristic, 286 
Unitary matrix, 449 
Unsatisfiability, 285 


Variance, 427 
variational method, 419 
VC-dimension, 141 
convex polygons, 143 
finite sets, 143 
half spaces, 144 
intervals, 143 
pairs of intervals, 143 
rectangles, 143 
spheres, 145 
Viterbi algorithm, 339 
Voltage 
probabilistic interpretation, 100 


Wavelet, 390 
World Wide Web, 113 


Young's inequality, 418, 421 
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